from:"Li"

What will happen when one thread is closing a searcher while another is searching?

2011-09-05 Thread Li Li

hi all,
 I am using spellcheck in solr 1.4. I found that spell check is not
implemented as SolrCore. in SolrCore, it uses reference count to track
current searcher. oldSearcher and newSearcher will both exist if oldSearcher
is servicing some query. But in FileBasedSpellChecker

  public void build(SolrCore core, SolrIndexSearcher searcher) {
try {
  loadExternalFileDictionary(core.getSchema(),
core.getResourceLoader());
  spellChecker.clearIndex();
  spellChecker.indexDictionary(dictionary);
} catch (IOException e) {
  throw new RuntimeException(e);
}
  }
  public void clearIndex() throws IOException {
IndexWriter writer = new IndexWriter(spellIndex, null, true);
writer.close();

//close the old searcher
searcher.close();
searcher = new IndexSearcher(this.spellIndex);
  }

  it clear old Index and close current searcher. When other thread is doing
search and searcher.close() is called, will it cause problem? Or
searcher.close() has finished and new IndexSearch has not yet constructed.
When other thread try to do search, will it also be problematic?

Re: Multi CPU Cores

2011-10-16 Thread Li Li

for indexing, your can make use of multi cores easily by call
IndexWriter.addDocument with multi-threads
as far as I know, for searching, if there is only one request, you can't
make good use of cpus.

On Sat, Oct 15, 2011 at 9:37 PM, Rob Brown  wrote:

> Hi,
>
> I'm running Solr on a machine with 16 CPU cores, yet watching "top" shows
> that java is only apparently using 1 and maxing it out.
>
> Is there anything that can be done to take advantage of more CPU cores?
>
> Solr 3.4 under Tomcat
>
> [root@solr01 ~]# java -version
> java version "1.6.0_20"
> OpenJDK Runtime Environment (IcedTea6 1.9.8) (rhel-1.22.1.9.8.el5_6-x86_64)
> OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
>
>
> top - 14:36:18 up 22 days, 21:54,  4 users,  load average: 1.89, 1.24, 1.08
> Tasks: 317 total,   1 running, 315 sleeping,   0 stopped,   1 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.6%id,  0.4%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu6  : 99.6%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu13 :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
>  0.0%st
> Mem:  132088928k total, 23760584k used, 108328344k free,   318228k buffers
> Swap: 25920868k total,0k used, 25920868k free, 18371128k cached
>
>  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  4466 tomcat20   0 31.2g 4.0g 171m S 101.0  3.2   2909:38 java
>  6495 root  15   0 42416 3892 1740 S  0.4  0.0   9:34.71 openvpn
> 11456 root  16   0 12892 1312  836 R  0.4  0.0   0:00.08 top
>1 root  15   0 10368  632  536 S  0.0  0.0   0:04.69 init
>
>

Re: Want to support "did you mean xxx" but is Chinese

2011-10-21 Thread Li Li

we have implemented one supporting "did you mean" and preffix suggestion
for Chinese. But we base our working on solr 1.4 and we did many
modifications so it will cost time to integrate it to current solr/lucene.

 Here are our solution. glad to see any advices.

 1. offline words and phrases discovery.
   we discovery new words and new phrases by mining query logs

 2. online matching algorithm
   for each word, e.g., 贝多芬
   we convert it to pinyin bei duo fen, then we indexing it using
n-gram, which means gram3:bei gram3:eid ...
   to get "did you mean" result, we convert query 背朵分 into n-gram,
it's a boolean or query, so there are many results( the words' pinyin
similar to query will be ranked top)
  Then we reranks top 500 results by fine-grained algorithm
  we use edit distance to align query and result, we also take
character into consideration. e.g query 十度,matches are 十渡 and 是度,their
pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in
both query and match
  also you need consider the hotness(popular degree) of different
words/phrases. which can be known from query logs

  Another question is to convert Chinese into pinyin. because some
character has more than one pinyin.
 e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and
words/phrases first. word segmentation is a basic problem is Chinese IR


2011/10/21 Floyd Wu 

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
>
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
>
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
>
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
>
> Floyd
>

Re: Can't find resource 'solrconfig.xml'

2011-10-31 Thread Li Li

modify catalina.sh(bat)
adding java startup params:
-Dsolr.solr.home=/your/path

On Mon, Oct 31, 2011 at 8:30 PM, 刘浪  wrote:

> Hi，
>  After I start tomcat， I input http://localhost:8080/solr/admin. It
> can display. But in the tomcat, I find an exception like "Can't find
> resource 'solrconfig.xml' in classpath or 'solr\.\conf/', cwd=D:\Program
> Files (x86)\apache-tomcat-6.0.33\bin". It occures before "Server start up
> in 1682 ms."
>  What should I do?Thank you very much.
>
>  Solr Directory: D:\Program Files (x86)\solr. It contents bin, conf,
> data, solr.xml, and README.txt.
>  Tomcat Directory: D:\Program Files (x86)\apache-tomcat-6.0.33.
>
> Sincerely,
> Amos
>

Re: RE: Can't find resource 'solrconfig.xml'

2011-10-31 Thread Li Li

set JAVA_OPTS=%JAVA_OPTS% -Dsolr.solr.home=c:\xxx

On Mon, Oct 31, 2011 at 9:14 PM, 刘浪  wrote:

> Hi Li Li,
>I don't know where I should add in catalina.bat.  I have know Linux
> how to do it, but my OS is windows.
>Thank you very much.
>
> Sincerely,
> Amos
>
>
> this is the part of catalina.bat:
>
> rem Execute Java with the applicable properties
> if not "%JPDA%" == "" goto doJpda
> if not "%SECURITY_POLICY_FILE%" == "" goto doSecurity
> %_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %DEBUG_OPTS%
> -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%"
> -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%"
> -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
> goto end
> :doSecurity
> %_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %DEBUG_OPTS%
> -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%"
> -Djava.security.manager -Djava.security.policy=="%SECURITY_POLICY_FILE%"
> -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%"
> -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
> goto end
> :doJpda
> if not "%SECURITY_POLICY_FILE%" == "" goto doSecurityJpda
> %_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %JPDA_OPTS% %DEBUG_OPTS%
> -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%"
> -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%"
> -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
> goto end
> :doSecurityJpda
> %_EXECJAVA% %JAVA_OPTS% %CATALINA_OPTS% %JPDA_OPTS% %DEBUG_OPTS%
> -Djava.endorsed.dirs="%JAVA_ENDORSED_DIRS%" -classpath "%CLASSPATH%"
> -Djava.security.manager -Djava.security.policy=="%SECURITY_POLICY_FILE%"
> -Dcatalina.base="%CATALINA_BASE%" -Dcatalina.home="%CATALINA_HOME%"
> -Djava.io.tmpdir="%CATALINA_TMPDIR%" %MAINCLASS% %CMD_LINE_ARGS% %ACTION%
> goto end
>
> :end
>
>
>
> --
>
>
> > -原始邮件-
> > 发件人: "Brandon Ramirez" 
> > 发送时间: 2011年10月31日 星期一
> > 收件人: "solr-user@lucene.apache.org" 
> > 抄送:
> > 主题: RE: Can't find resource 'solrconfig.xml'
> >
> > I have found setenv.sh to be very helpful.  It's a hook where you can
> setup environment variables and java options without modifying your
> catalina.sh script.  This makes upgrading a whole lot easier.
> >
> >
> > Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848
> > Software Engineer II | Element K | www.elementk.com
> >
> >
> > -Original Message-
> > From: Li Li [mailto:fancye...@gmail.com]
> > Sent: Monday, October 31, 2011 8:35 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Can't find resource 'solrconfig.xml'
> >
> > modify catalina.sh(bat)
> > adding java startup params:
> > -Dsolr.solr.home=/your/path
> >
> > On Mon, Oct 31, 2011 at 8:30 PM, 刘浪  wrote:
> >
> > > Hi，
> > >  After I start tomcat， I input http://localhost:8080/solr/admin.
> > > It can display. But in the tomcat, I find an exception like "Can't
> > > find resource 'solrconfig.xml' in classpath or 'solr\.\conf/',
> > > cwd=D:\Program Files (x86)\apache-tomcat-6.0.33\bin". It occures
> > > before "Server start up in 1682 ms."
> > >  What should I do?Thank you very much.
> > >
> > >  Solr Directory: D:\Program Files (x86)\solr. It contents bin,
> > > conf, data, solr.xml, and README.txt.
> > >  Tomcat Directory: D:\Program Files (x86)\apache-tomcat-6.0.33.
> > >
> > > Sincerely,
> > > Amos
> > >
>

collapse exception

2010-06-21 Thread Li Li

it says  "Either filter or filterList may be set in the QueryCommand,
but not both." I am newbie of solr and have no idea of the exception.
What's wrong with it? thank you.

java.lang.IllegalArgumentException: Either filter or filterList may be
set in the QueryCommand, but not both.
at 
org.apache.solr.search.SolrIndexSearcher$QueryCommand.setFilter(SolrIndexSearcher.java:1711)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1286)
at 
org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:205)
at 
org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:246)
at 
org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:173)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:174)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
at java.lang.Thread.run(Thread.java:619)

Re: collapse exception

2010-06-21 Thread Li Li

I don't know because it's patched by someone else but I can't get his
help. When this component become a contrib? Using patch is so annoying

2010/6/22 Martijn v Groningen :
> What version of Solr and which patch are you using?
>
> On 21 June 2010 11:46, Li Li  wrote:
>> it says  "Either filter or filterList may be set in the QueryCommand,
>> but not both." I am newbie of solr and have no idea of the exception.
>> What's wrong with it? thank you.
>>
>> java.lang.IllegalArgumentException: Either filter or filterList may be
>> set in the QueryCommand, but not both.
>>        at 
>> org.apache.solr.search.SolrIndexSearcher$QueryCommand.setFilter(SolrIndexSearcher.java:1711)
>>        at 
>> org.apache.solr.search.SolrIndexSearcher.getDocListAndSet(SolrIndexSearcher.java:1286)
>>        at 
>> org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:205)
>>        at 
>> org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.executeCollapse(AbstractDocumentCollapser.java:246)
>>        at 
>> org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:173)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:174)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:127)
>>        at 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203)
>>        at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>        at 
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>        at 
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>        at 
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>        at 
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>        at 
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>>        at 
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>        at 
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>        at 
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
>>        at 
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
>>        at 
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>>        at 
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
>>        at java.lang.Thread.run(Thread.java:619)
>>
>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>

about function query

2010-06-22 Thread Li Li

I want to integrate document's timestamp into scoring of search. And I
find an example in the book "Solr 1.4 Enterprise Search Server" about
function query. I want to boost a document which is newer. so it may
be a function such as 1/(timestamp+1) . But the function query is
added to the final result, not multiplied. So I can't adjust the
parameter well.
e.g
search term is term1, topdocs are doc1 with score 2.0; doc2 with score 1.5.
search term is term2, topdocs are doc1 with score 20;  doc2 with score 15.
it is hard to adjust the relative score of these 2 docs with add a value.  i
if it is multiply, it's easy. if doc1 is very old, we assign a score
1,and doc2 is new, we assign a score 2
thus total score is 2.0*1 1.5*2 . So doc2 rank higher than doc1
but when use add,  2.0 + weight*1, 1.5 +weight*2, it's hard to get a
proper weight.
if we let weight is 1, it works well for term1
but with term2, it 20 +1*1.5 15+1*2  time has little influence on the
final result.

is there a "delete all" command in updateHandler?

2010-06-27 Thread Li Li

I want to delete all index and rebuild index frequently. I can't
delete the index files directly because I want to use replication

index format error because disk full

2010-07-06 Thread Li Li

the index file is ill-formated because disk full when feeding. Can I
roll back to last version? Is there any method to avoid unexpected
errors when indexing? attachments are my segment_N

How to manage resource out of index?

2010-07-06 Thread Li Li

I used to store full text into lucene index. But I found it's very
slow when merging index because when merging 2 segments it copy the
fdt files into a new one. So I want to only index full text. But When
searching I need the full text for applications such as hightlight and
view full text. I can store the full text by  pair in
database and load it to memory. And When I search in lucene(or solr),
I retrive url of doc first, then use url to get full text. But when
they are stored separately, it is hard to managed. They may be not
consistent with each other. Does lucene or solr provied any method to
ease this problem? Or any one  has some experience of this problem?

Re: index format error because disk full

2010-07-07 Thread Li Li

I use SegmentInfos to read the segment_N file and found the error is
that it try to load deletedDocs but the .del file's size is 0(because
of disk error) . So I use SegmentInfos to set delGen=-1 to ignore
deleted Docs.
But I think there is some bug. The logic of  write my be -- it first
writes the .del file then write the segment_N file. But it only write
to buffer and don't flush to disk immediately. So when disk full. it
may happen that segment_N file is flushed but del file faild.

2010/7/8 Lance Norskog :
> If autocommit does not to an automatic rollback, that is a serious bug.
>
> There should be a way to detect that an automatic rollback has
> happened, but I don't know what it is. Maybe something in the Solr
> MBeans?
>
> On Wed, Jul 7, 2010 at 5:41 AM, osocurious2  wrote:
>>
>> I haven't used this myself, but Solr supports a
>> http://wiki.apache.org/solr/UpdateXmlMessages#A.22rollback.22 rollback
>> function. It is supposed to rollback to the state at the previous commit. So
>> you may want to turn off auto-commit on the index you are updating if you
>> want to control what that last commit level is.
>>
>> However, in your case if the index gets corrupted due to a disk full
>> situation, I don't know what rollback would do, if anything, to help. You
>> may need to play with the scenario to see what would happen.
>>
>> If you are using the DataImportHandler it may handle the rollback for
>> you...again, however, it may not deal with disk full situations gracefully
>> either.
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/index-format-error-because-disk-full-tp948249p948968.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Distributed Indexing

2010-07-08 Thread Li Li

Is there any tools for "Distributed Indexing"? It refers to
KattaIntegration  and ZooKeeperIntegration in
http://wiki.apache.org/solr/DistributedSearch.
But it seems that they concern more on error processing and
replication. I need a dispatcher that dispatch different docs by
uniqueKey(such as url)  to different machines. And when a doc is
updated, the doc is sent to the machine that contains the url. Also I
need the docs are randomly sent to all the machines so that when I do
a distributed search the idfs of different machines are similar
because the current distributed search's idf are local.

how to save a snapshot of an index?

2010-07-11 Thread Li Li

When I add some docs by
post.jar(org.apache.solr.util.SimplePostTool), It commits after all
docs are added. It will call IndexWriter.commit(). And a new segment
will be added and sometimes it triggers segment merging. New index
files will be generated(frm, tii,tis, ). Old segments will be
deleted after all references are closed(All the reader which open it).
That's ok. But I want to backup a version of my index so that when
something wrong happen I can use it. I can write a script to backup
all the files in the index directory everyday. But it may happen that
when it's indexing, the script may backup wrong files. So it must
obtain the ***.lock file to make things right. Is there any built in
tools in solr for my need ? I just want to back up the index
periodly(such as 0 clock every day).

Cache full text into memory

2010-07-13 Thread Li Li

 I want to cache full text into memory to improve performance.
Full text is only used to highlight in my application(But it's very
time consuming, My avg query time is about 250ms, I guess it will cost
about 50ms if I just get top 10 full text. Things get worse when get
more full text because in disk, it scatters erverywhere for a query.).
My full text per machine is about 200GB. The memory available for
store full text is about 10GB. So I want to compress it in memory.
Suppose compression ratio is 1:5, then I can load 1/4 full text in
memory. I need a Cache component for it. Has anyone faced the problem
before? I need some advice. Is it possbile using external tools such
as MemCached? Thank you.

Re: Cache full text into memory

2010-07-14 Thread Li Li

I have already store it in lucene index. But it is in disk and When a
query come, it must seek the disk to get it. I am not familiar with
lucene cache. I just want to fully use my memory that load 10GB of it
in memory and a LRU stragety when cache full. To load more into
memory, I want to compress it "in memory". I don't care much about
disk space so whether or not it's compressed in lucene .

2010/7/14 findbestopensource :
> You have two options
> 1. Store the compressed text as part of stored field in Solr.
> 2. Using external caching.
> http://www.findbestopensource.com/tagged/distributed-caching
>    You could use ehcache / Memcache / Membase.
>
> The problem with external caching is you need to synchronize the deletions
> and modification. Fetching the stored field from Solr is also faster.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
> On Wed, Jul 14, 2010 at 12:08 PM, Li Li  wrote:
>
>>     I want to cache full text into memory to improve performance.
>> Full text is only used to highlight in my application(But it's very
>> time consuming, My avg query time is about 250ms, I guess it will cost
>> about 50ms if I just get top 10 full text. Things get worse when get
>> more full text because in disk, it scatters erverywhere for a query.).
>> My full text per machine is about 200GB. The memory available for
>> store full text is about 10GB. So I want to compress it in memory.
>> Suppose compression ratio is 1:5, then I can load 1/4 full text in
>> memory. I need a Cache component for it. Has anyone faced the problem
>> before? I need some advice. Is it possbile using external tools such
>> as MemCached? Thank you.
>>
>

Re: Cache full text into memory

2010-07-14 Thread Li Li

Thank you. I don't know which cache system to use. In my application,
the cache system must support compression algorithm which has high
compression ratio and fast decompression speed(because each time it
get from cache, it must decompress).

2010/7/14 findbestopensource :
> I have just provided you two options. Since you already store as part of the
> index, You could try external caching. Try using ehcache / Membase
> http://www.findbestopensource.com/tagged/distributed-caching . The caching
> system will do LRU and is much more efficient.
>
> On Wed, Jul 14, 2010 at 12:39 PM, Li Li  wrote:
>
>> I have already store it in lucene index. But it is in disk and When a
>> query come, it must seek the disk to get it. I am not familiar with
>> lucene cache. I just want to fully use my memory that load 10GB of it
>> in memory and a LRU stragety when cache full. To load more into
>> memory, I want to compress it "in memory". I don't care much about
>> disk space so whether or not it's compressed in lucene .
>>
>> 2010/7/14 findbestopensource :
>>  > You have two options
>> > 1. Store the compressed text as part of stored field in Solr.
>> > 2. Using external caching.
>> > http://www.findbestopensource.com/tagged/distributed-caching
>> >    You could use ehcache / Memcache / Membase.
>> >
>> > The problem with external caching is you need to synchronize the
>> deletions
>> > and modification. Fetching the stored field from Solr is also faster.
>> >
>> > Regards
>> > Aditya
>> > www.findbestopensource.com
>> >
>> >
>> > On Wed, Jul 14, 2010 at 12:08 PM, Li Li  wrote:
>> >
>> >>     I want to cache full text into memory to improve performance.
>> >> Full text is only used to highlight in my application(But it's very
>> >> time consuming, My avg query time is about 250ms, I guess it will cost
>> >> about 50ms if I just get top 10 full text. Things get worse when get
>> >> more full text because in disk, it scatters erverywhere for a query.).
>> >> My full text per machine is about 200GB. The memory available for
>> >> store full text is about 10GB. So I want to compress it in memory.
>> >> Suppose compression ratio is 1:5, then I can load 1/4 full text in
>> >> memory. I need a Cache component for it. Has anyone faced the problem
>> >> before? I need some advice. Is it possbile using external tools such
>> >> as MemCached? Thank you.
>> >>
>> >
>>
>

about warm up

2010-07-14 Thread Li Li

I want to load full text into an external cache, So I added so codes
in newSearcher where I found the warm up takes place. I add my codes
before solr warm up  which is configed in solrconfig.xml like this:

  
  ...
  


public void newSearcher(SolrIndexSearcher newSearcher,
SolrIndexSearcher currentSearcher) {
warmTextCache(newSearcher,warmTextCache,new String[]{"title","content"});

for (NamedList nlst : (List)args.get("queries")) {

}
}

in warmTextCache I need a reader to get some docs
for(int i=0;i0 && !forceNew && _searcher==null) {
try {
Line 1000  searcherLock.wait();
} catch (InterruptedException e) {
  log.info(SolrException.toStr(e));
}
  }
And about 5 minutes later. it's ok.

So How can I get a "safe" reader in this situation?

Re: Solr Statistics, num docs

2010-07-15 Thread Li Li

numDocs is the total indexed docs. May be your docs have duplicated
key. When duplicated, the older one will be deleted. uniqueKey is
defined in solrconfig.xml

2010/7/16 Karthik K :
> Hi,
> Is numDocs in solr statistics equal to the total number of documents that
> are searchable on solr? I find that this number is very low in my case
> compared to the total number of documents indexed. Please let me know the
> possible reasons for this.
>
> Thanks,
> Karthik
>

Re: Ranking based on term position

2010-07-19 Thread Li Li

I have considerd this problem and tried to solve it using 2 methods
By these methods, we also can boost a doc by the relative positions of
query terms.

1: add term Position when indexing
   modify TermScorer.score

  public float score() {
assert doc != -1;
int f = freqs[pointer];
float raw =   // compute tf(f)*weight
  f < SCORE_CACHE_SIZE// check cache
  ? scoreCache[f] // cache hit
  : getSimilarity().tf(f)*weightValue;// cache miss
//modified by LiLi
try {
int[] positions=this.getPositions(f);
float positionBoost=1.0f;
for(int pos:positions){
positionBoost*=this.getPositionBoost(pos);
}
raw*=positionBoost;
} catch (IOException e) {
}
//modified
return norms == null ? raw : raw * SIM_NORM_DECODER[norms[doc] &
0xFF]; // normalize for field
  }


  private int[] getPositions(int f) throws IOException{
  termPositions.skipTo(doc);
  int[] positions=new int[f];
  int docId = termPositions.doc();
  assert docId==doc;
  int tf=termPositions.freq();
  assert tf==f;
  for(int i=0;i:
> I need to make sure that documents with the search term occurring
> towards the beginning of the document are ranked higher.
>
> For example,
>
> Search term : ox
> Doc 1: box fox ox
> Doc 2: ox box fox
>
> Result: Doc2 will be ranked higher than Doc1.
>
> The solution I can think of is sorting by term position (after enabling
> term vectors). Is that the best way to go about it ?
>
> Thanks
> Papiya
>
>
> Pink OTC Markets Inc. provides the leading inter-dealer quotation and
> trading system in the over-the-counter (OTC) securities market.   We create
> innovative technology and data solutions to efficiently connect market
> participants, improve price discovery, increase issuer disclosure, and
> better inform investors.   Our marketplace, comprised of the issuer-listed
> OTCQX and broker-quoted   Pink Sheets, is the third largest U.S. equity
> trading venue for company shares.
>
> This document contains confidential information of Pink OTC Markets and is
> only intended for the recipient.   Do not copy, reproduce (electronically or
> otherwise), or disclose without the prior written consent of Pink OTC
> Markets.      If you receive this message in error, please destroy all
> copies in your possession (electronically or otherwise) and contact the
> sender above.
>

a bug of solr distributed search

2010-07-21 Thread Li Li

in QueryComponent.mergeIds. It will remove document which has
duplicated uniqueKey with others. In current implementation, it use
the first encountered.
  String prevShard = uniqueDoc.put(id, srsp.getShard());
  if (prevShard != null) {
// duplicate detected
numFound--;
collapseList.remove(id+"");
docs.set(i, null);//remove it.
// For now, just always use the first encountered since we
can't currently
// remove the previous one added to the priority queue.
If we switched
// to the Java5 PriorityQueue, this would be easier.
continue;
// make which duplicate is used deterministic based on shard
// if (prevShard.compareTo(srsp.shard) >= 0) {
//  TODO: remove previous from priority queue
//  continue;
// }
  }

 It iterate ove ShardResponse by
for (ShardResponse srsp : sreq.responses)
But the sreq.responses may be different. That is -- shard1's result
and shard2's result may interchange position
So when an uniqueKey(such as url) occurs in both shard1 and shard2.
which one will be used is unpredicatable. But the socre of these 2
docs are different because of different idf.
So the same query will get different result.
One possible solution is to sort ShardResponse srsp  by shard name.

Re: a bug of solr distributed search

2010-07-21 Thread Li Li

But users will think there is something wrong with it when he/she
search the same query but got different result.

2010/7/21 MitchK :
>
> Li Li,
>
> this is the intended behaviour, not a bug.
> Otherwise you could get back the same record in a response for several
> times, which may not be intended by the user.
>
> Kind regards,
> - Mitch
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983675.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: a bug of solr distributed search

2010-07-21 Thread Li Li

yes. This will make user think our search engine has some bug.
from the comments of the codes, it needs more things to do
  if (prevShard != null) {
// For now, just always use the first encountered since we
can't currently
// remove the previous one added to the priority queue.
If we switched
// to the Java5 PriorityQueue, this would be easier.
continue;
// make which duplicate is used deterministic based on shard
// if (prevShard.compareTo(srsp.shard) >= 0) {
//  TODO: remove previous from priority queue
//  continue;
// }
  }

2010/7/21 MitchK :
>
> Ah, okay. I understand your problem. Why should doc x be at position 1 when
> searching for the first time, and when I search for the 2nd time it occurs
> at position 8 - right?
>
> I am not sure, but I think you can't prevent this without custom coding or
> making a document's occurence unique.
>
> Kind regards,
> - Mitch
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p983771.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: a bug of solr distributed search

2010-07-21 Thread Li Li

I think what Siva mean is that when there are docs with the same url,
leave the doc whose score is large.
This is the right solution.
But itshows a problem of distrubted search without common idf. A doc
will get different score in different shard.
2010/7/22 MitchK :
>
> It already was sorted by score.
>
> The problem here is the following:
> Shard_A and shard_B contain doc_X and doc_X.
> If you are querying for something, doc_X could have a score of 1.0 at
> shard_A and a score of 12.0 at shard_B.
>
> You can never be sure which doc Solr sees first. In the bad case, Solr sees
> the doc_X firstly at shard_A and ignores it at shard_B. That means, that the
> doc maybe would occur at page 10 in pagination, although it *should* occur
> at page 1 or 2.
>
> Kind regards,
> - Mitch
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/a-bug-of-solr-distributed-search-tp983533p984743.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Which is a good XPath generator?

2010-07-24 Thread Li Li

it's not a related topic in solr. maybe you should read some papers
about wrapper generation or automatical web data extraction. If you
want to generate xpath, you could possibly read liubing's papers such
as "Structured Data Extraction from the Web based on Partial Tree
Alignment". Besides dom tree, visual clues also may be used. But none
of them will be perfect solution because of the diversity of web
pages.

2010/7/25 Savannah Beckett :
> Hi,
>   I am looking for a XPath generator that can generate xpath by picking a
> specific tag inside a html.  Do you know a good xpath generator?  If possible,
> free xpath generator would be great.
> Thanks.
>
>
>

Re: a bug of solr distributed search

2010-07-25 Thread Li Li

where is the link of this patch?

2010/7/24 Yonik Seeley :
> On Fri, Jul 23, 2010 at 2:23 PM, MitchK  wrote:
>> why do we do not send the output of TermsComponent of every node in the
>> cluster to a Hadoop instance?
>> Since TermsComponent does the map-part of the map-reduce concept, Hadoop
>> only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
>> After reducing, every node in the cluster gets the current values to compute
>> the idf.
>> We can store this information in a HashMap-based SolrCache (or something
>> like that) to provide constant-time access. To keep the values up to date,
>> we can repeat that after every x minutes.
>
> There's already a patch in JIRA that does distributed IDF.
> Hadoop wouldn't be the right tool for that anyway... it's for batch
> oriented systems, not low-latency queries.
>
>> If we got that, it does not care whereas we use doc_X from shard_A or
>> shard_B, since they will all have got the same scores.
>
> That only works if the docs are exactly the same - they may not be.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: a bug of solr distributed search

2010-07-25 Thread Li Li

the solr version I used is 1.4

2010/7/26 Li Li :
> where is the link of this patch?
>
> 2010/7/24 Yonik Seeley :
>> On Fri, Jul 23, 2010 at 2:23 PM, MitchK  wrote:
>>> why do we do not send the output of TermsComponent of every node in the
>>> cluster to a Hadoop instance?
>>> Since TermsComponent does the map-part of the map-reduce concept, Hadoop
>>> only needs to reduce the stuff. Maybe we even do not need Hadoop for this.
>>> After reducing, every node in the cluster gets the current values to compute
>>> the idf.
>>> We can store this information in a HashMap-based SolrCache (or something
>>> like that) to provide constant-time access. To keep the values up to date,
>>> we can repeat that after every x minutes.
>>
>> There's already a patch in JIRA that does distributed IDF.
>> Hadoop wouldn't be the right tool for that anyway... it's for batch
>> oriented systems, not low-latency queries.
>>
>>> If we got that, it does not care whereas we use doc_X from shard_A or
>>> shard_B, since they will all have got the same scores.
>>
>> That only works if the docs are exactly the same - they may not be.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>

Re: Problem with parsing date

2010-07-26 Thread Li Li

I uses format like -MM-ddThh:mm:ssZ. it works

2010/7/26 Rafal Bluszcz Zawadzki :
> Hi,
>
> I am using Data Import Handler from Solr 1.4.
>
> Parts of my data-config.xml are:
>
>
>                        processor="XPathEntityProcessor"
>                stream="false"
>                forEach="/multistatus/response"
>                url="/tmp/file.xml"
>
>  transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer"
>                >
> .
>
>              xpath="/multistatus/response/propstat/prop/getlastmodified"
> dateTimeFormat="EEE, d MMM  HH:mm:ss z" />
>              xpath="/multistatus/response/propstat/prop/creationdate"
> dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'"/>
>
> During full-import I got message:
>
> WARNING: Error creating document :
> SolrInputDocument[{SearchableText=SearchableText(1.0)={phrase},
> parentPaths=parentPaths(1.0)={/site},
> review_state=review_state(1.0)={published}, created=created(1.0)={Sat Oct 11
> 14:38:27 CEST 2003}, UID=UID(1.0)={http://www.example.com:80/File-1563},
> Title=Title(1.0)={This is only an example document},
> portal_type=portal_type(1.0)={Document}, modified=modified(1.0)={Wed, 15 Jul
> 2009 08:23:34 GMT}}]
> org.apache.solr.common.SolrException: Invalid Date String:'Wed, 15 Jul 2009
> 08:23:34 GMT'
> at org.apache.solr.schema.DateField.parseMath(DateField.java:163)
> at org.apache.solr.schema.TrieDateField.createField(TrieDateField.java:171)
> at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
> at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:246)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
>
> Which as I understand, means that Solr / Java coudnt parse my date.
>
> In my xml file it looks like:
> Wed, 15 Jul 2009 08:23:34 GMT
>
> In my opinion format "EEE, d MMM  HH:mm:ss z" is correct, and what more
> important - it was suppouse to work with same data week ago :)
>
> Any idea will be appreciate.
>
> --
> Rafal Zawadzki
> Backend developer
>

Is there a cache for a query?

2010-07-26 Thread Li Li

I want a cache to cache all result of a query(all steps including
collapse, highlight and facet).  I read
http://wiki.apache.org/solr/SolrCaching, but can't find a global
cache. Maybe I can use external cache to store key-value. Is there any
one in solr?

Re: Speed up Solr Index merging

2010-07-29 Thread Li Li

I faced this problem but can't find any good solution. But if you have
large stored field such as full text of document. If you don't store
it in lucene, it will be quicker because 2 merge indexes will force
copy all fdts into a new fdt. If you store it externally. The problem
you have to face is how to manage it. maybe you have a uniqueKey such
as url to store this key-value to somewhere.
2010/7/29 Karthik K :
> I need to merge multiple solr indexes into one big index. The process is
> very slow. Please share any tips to speed it up. Will optimizing the indexes
> before merging help?
>
> Thanks,
> Karthik
>

Re: Solr searching performance issues, using large documents

2010-07-30 Thread Li Li

hightlight's time is mainly spent on getting the field which you want
to highlight and tokenize this field(If you don't store term vector) .
you can check what's wrong,

2010/7/30 Peter Spam :
> If I don't do highlighting, it's really fast.  Optimize has no effect.
>
> -Peter
>
> On Jul 29, 2010, at 11:54 AM, dc tech wrote:
>
>> Are you storing the entire log file text in SOLR? That's almost 3gb of
>> text that you are storing in the SOLR. Try to
>> 1) Is this first time performance or on repaat queries with the same fields?
>> 2) Optimze the index and test performance again
>> 3) index without storing the text and see what the performance looks like.
>>
>>
>> On 7/29/10, Peter Spam  wrote:
>>> Any ideas?  I've got 5000 documents with an average size of 850k each, and
>>> it sometimes takes 2 minutes for a query to come back when highlighting is
>>> turned on!  Help!
>>>
>>>
>>> -Pete
>>>
>>> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>>>
 From the mailing list archive, Koji wrote:

> 1. Provide another field for highlighting and use copyField to copy
> plainText to the highlighting field.

 and Lance wrote:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html

> If you want to highlight field X, doing the
> termOffsets/termPositions/termVectors will make highlighting that field
> faster. You should make a separate field and apply these options to that
> field.
>
> Now: doing a copyfield adds a "value" to a multiValued field. For a text
> field, you get a multi-valued text field. You should only copy one value
> to the highlighted field, so just copyField the document to your special
> field. To enforce this, I would add multiValued="false" to that field,
> just to avoid mistakes.
>
> So, all_text should be indexed without the term* attributes, and should
> not be stored. Then your document stored in a separate field that you use
> for highlighting and has the term* attributes.

 I've been experimenting with this, and here's what I've tried:

  >>> multiValued="true" termVectors="true" termPositions="true" termOff
 sets="true" />
  >>> multiValued="true" />
  

 ... but it's still very slow (10+ seconds).  Why is it better to have two
 fields (one indexed but not stored, and the other not indexed but stored)
 rather than just one field that's both indexed and stored?


 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors

> If you aren't always using all the stored fields, then enabling lazy
> field loading can be a huge boon, especially if compressed fields are
> used.

 What does this mean?  How do you load a field lazily?

 Thanks for your time, guys - this has started to become frustrating, since
 it works so well, but is very slow!


 -Pete

 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:

> Data set: About 4,000 log files (will eventually grow to millions).
> Average log file is 850k.  Largest log file (so far) is about 70MB.
>
> Problem: When I search for common terms, the query time goes from under
> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
> disable highlighting, performance improves a lot, but is still slow for
> some queries (7 seconds).  Thanks in advance for any ideas!
>
>
> -Peter
>
>
> -
>
> 4GB RAM server
> % java -Xms2048M -Xmx3072M -jar start.jar
>
> -
>
> schema.xml changes:
>
>  
>    
>      
>    
>     generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>    
>  
>
> ...
>
>  multiValued="false" termVectors="true" termPositions="true"
> termOffsets="true" />
>   default="NOW" multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>
> ...
>
> 
> body
> 
>
> -
>
> solrconfig.xml changes:
>
>  2147483647
>  128
>
> -
>
> The query:
>
> rowStr = "&rows=10"
> facet =
> "&facet=true&facet.limit=10&facet.field=device&facet

how to ignore position in indexing?

2010-07-31 Thread Li Li

hi all
 in lucene, we can only store tf of a term's invert list. in my
application, I only provide dismax query with boolean query and don't
support queries which need position info such as phrase query. So I
don't want to store position info in prx file. How to turn off it? And
if I turn off it, will search become quicker?

a small problem of distributed search

2010-08-16 Thread Li Li

  current implementation of distributed search use unique key in the
STAGE_EXECUTE_QUERY stage.

  public int distributedProcess(ResponseBuilder rb) throws IOException {
...
if (rb.stage == ResponseBuilder.STAGE_EXECUTE_QUERY) {
  createMainQuery(rb);
  return ResponseBuilder.STAGE_GET_FIELDS;
}
...
  }

  in CreateMainQuery
  sreq.params.set(CommonParams.FL,
rb.req.getSchema().getUniqueKeyField().getName() + ",score");
  which will set fl=url,score
  url is my unique key which is indexed without analyzed and stored
  the url is actually load in BinaryResponseWriter.writeDocList
  which call Document doc = searcher.doc(id, returnFields); //url is
in returnFields

  So all the url of top N doc's url is read from fdt file
  But unique key is usually short and can be loaded into memory. So we
can use StringIndex to cache it.
  In my application, we need top 100 docs for collpasing and
reranking. And it speeds up more than 50ms(we use SCSI disk) for each
query and worst results become less frequent.

Re: Color search for images

2010-09-16 Thread Li Li

do you mean content based image retrieval or just search images by tag?
if the former, you can try LIRE

2010/9/15 Shawn Heisey :
>  My index consists of metadata for a collection of 45 million objects, most
> of which are digital images.  The executives have fallen in love with
> Google's color image search.  Here's a search for "flower" with a red color
> filter:
>
> http://www.google.com/images?q=flower&tbs=isch:1,ic:specific,isc:red
>
> I am interested in duplicating this.  Can this group of fine people point me
> in the right direction?  I don't want anyone to do it for me, just help me
> find software and/or algorithms that can extract the color information, then
> find a way to get Solr to index and search it.
>
> Thanks,
> Shawn
>
>

Re: Can Solr do approximate matching?

2010-09-22 Thread Li Li

It seems there is a SimilarLikeThis in lucene . I don't know whether a
counterpart in solr. It just use the found document as a query to find
similar documents. Or you just use boolean or query and similar
questions with getting higher score. Of course, you can analyse the
question using some NLP techs such as identifying entities and ingore
less usefull words such as "which" "is" ... but I guess tf*idf score
function will also work well

2010/9/22 Igor Chudov :
> Hi guys. I am new here. So if I am unwittingly violating any rules,
> let me know.
>
> I am working with Solr because I own algebra.com, where I have a
> database of 250,000 or so answered math questions. I want to use Solr
> to provide approximate matching functionality called "similar items".
> So that users looking at a problem could see how similar ones were
> answered.
>
> And my question is, does Solr support some "find similar"
> functionality. For example, in my mind, sentence "I like tasty
> strawberries" is 'similar' to a sentence such as "I like yummy
> strawberries", just because both have a few of the same words.
>
> So, to end my long winded query, how would I implement a "find top ten
> similar items to this one" functionality?
>
> Thanks!
>

is multi-threads searcher feasible idea to speed up?

2010-09-28 Thread Li Li

hi all
I want to speed up search time for my application. In a query, the
time is largly used in reading postlist(io with frq files) and
calculate scores and collect result(cpu, with Priority Queue). IO is
hardly optimized or already part optimized by nio. So I want to use
multithreads to utilize cpu. of course, it may be decrease QPS, but
the response time will also decrease-- that what I want. Because cpu
is easily obtained compared to faster hard disk.
I read the codes of searching roughly and find it's not an easy
task to modify search process. So I want to use other easy method .
One is use solr distributed search and dispatch documents to many
shards. but due to the network and global idf problem,it seems not a
good method for me.
Another one is to modify the index structure and averagely
dispatch frq files.
e.gterm1 -> doc1,doc2, doc3,doc4,doc5 in _1.frq
I create to 2 indexes with
term1->doc1,doc3,doc5
term1->doc2,doc4
when searching, I create 2 threads with 2 PriorityQueues to
collect top N docs and merging their results
Is the 2nd idea feasible? Or any one has related idea? thanks.

Re: is multi-threads searcher feasible idea to speed up?

2010-09-28 Thread Li Li

yes, there is a multisearcher in lucene. but it's idf in 2 indexes are
not global. maybe I can modify it and also the index like:
term1  df=5 doc1 doc3 doc5
term1  df=5 doc2 doc4

2010/9/28 Li Li :
> hi all
>    I want to speed up search time for my application. In a query, the
> time is largly used in reading postlist(io with frq files) and
> calculate scores and collect result(cpu, with Priority Queue). IO is
> hardly optimized or already part optimized by nio. So I want to use
> multithreads to utilize cpu. of course, it may be decrease QPS, but
> the response time will also decrease-- that what I want. Because cpu
> is easily obtained compared to faster hard disk.
>    I read the codes of searching roughly and find it's not an easy
> task to modify search process. So I want to use other easy method .
>    One is use solr distributed search and dispatch documents to many
> shards. but due to the network and global idf problem,it seems not a
> good method for me.
>    Another one is to modify the index structure and averagely
> dispatch frq files.
>    e.g    term1 -> doc1,doc2, doc3,doc4,doc5 in _1.frq
>    I create to 2 indexes with
>            term1->doc1,doc3,doc5
>            term1->doc2,doc4
>    when searching, I create 2 threads with 2 PriorityQueues to
> collect top N docs and merging their results
>    Is the 2nd idea feasible? Or any one has related idea? thanks.
>

question about SolrCore

2010-10-11 Thread Li Li

hi all,
I want to know the detail of IndexReader in SolrCore. I read a
little codes of SolrCore. Here is my understanding, are they correct?
Each SolrCore has many SolrIndexSearcher and keeps them in
_searchers. and _searcher keep trace of the latest version of index.
Each SolrIndexSearcher has a SolrIndexReader. If there isn't any
update, all these searchers share one single SolrIndexReader. If there
is an update, then a newSearcher will be created and a new
SolrIndexReader associated with it.
I did a simple test.
A thread do a query and blocked by breakpoint. Then I feed some
data to update index. When commit, a newSearcher is created.
Here is the debug info:

SolrCore _searcher [solrindexsearc...@...ab]

_searchers[solrindexsearc...@...77,solrindexsearc...@...ab,solrindexsearc...@..f8]
 solrindexsearc...@...77 's SolrIndexReader is old one
and ab and f8 share the same newest SolrIndexReader
When query finished solrindexsearc...@...77 is discarded. When
newSearcher success to warmup, There is only one SolrIndexSearcher.
The SolrIndexReader of old version of index is discarded and only
segments in newest SolrIndexReader are referenced. Those segments not
in new version can then be deleted because no file pointer reference
them
.
Then I start 3 queries. There is only one SolrIndexSearcher but RefCount=4.
It seems many search can share one single SolrIndexSearcher.
So in which situation, there will exist more than one
SolrIndexSearcher that they share just one SolrIndexReader?
Another question, for each version of index, is there just one
SolrIndexReader instance associated with it? will it occur that more
than one SolrIndexReader are opened and they are the same version of
index?

Re: How to manage different indexes for different users

2010-10-11 Thread Li Li

will one user search other user's index?
if not, you can use multi cores.

2010/10/11 Tharindu Mathew :
> Hi everyone,
>
> I'm using solr to integrate search into my web app.
>
> I have a bunch of users who would have to be given their own individual
> indexes.
>
> I'm wondering whether I'd have to append their user ID as I index a file.
> I'm not sure which approach to follow. Is there a sample or a doc I can read
> to understand how to approach this problem?
>
> Thanks in advance.
>
> --
> Regards,
>
> Tharindu
>

Re: question about SolrCore

2010-10-28 Thread Li Li

is there anyone could help me?

2010/10/11 Li Li :
> hi all,
>    I want to know the detail of IndexReader in SolrCore. I read a
> little codes of SolrCore. Here is my understanding, are they correct?
>    Each SolrCore has many SolrIndexSearcher and keeps them in
> _searchers. and _searcher keep trace of the latest version of index.
> Each SolrIndexSearcher has a SolrIndexReader. If there isn't any
> update, all these searchers share one single SolrIndexReader. If there
> is an update, then a newSearcher will be created and a new
> SolrIndexReader associated with it.
>    I did a simple test.
>    A thread do a query and blocked by breakpoint. Then I feed some
> data to update index. When commit, a newSearcher is created.
>    Here is the debug info:
>
>    SolrCore _searcher [solrindexsearc...@...ab]
>
> _searchers[solrindexsearc...@...77,solrindexsearc...@...ab,solrindexsearc...@..f8]
>                 solrindexsearc...@...77 's SolrIndexReader is old one
> and     ab and f8 share the same newest SolrIndexReader
>    When query finished solrindexsearc...@...77 is discarded. When
> newSearcher success to warmup, There is only one SolrIndexSearcher.
>    The SolrIndexReader of old version of index is discarded and only
> segments in newest SolrIndexReader are referenced. Those segments not
> in new version can then be deleted because no file pointer reference
> them
> .
>    Then I start 3 queries. There is only one SolrIndexSearcher but RefCount=4.
>    It seems many search can share one single SolrIndexSearcher.
>    So in which situation, there will exist more than one
> SolrIndexSearcher that they share just one SolrIndexReader?
>    Another question, for each version of index, is there just one
> SolrIndexReader instance associated with it? will it occur that more
> than one SolrIndexReader are opened and they are the same version of
> index?
>

Re: Does Solr support Natural Language Search

2010-11-04 Thread Li Li

I don't think current lucene will offer what you want now.
There are 2 main tasks in a search process.
One is "understanding" users' intension. Because natural language
understanding is difficult, Current Information Retrival systems
"force" users input some terms to express their needs. But terms have
ambiguations. e.g. apple may means a fruit or electronics. so users
are asked to inpinput more terms to disambiguate . e.g. apple fruit
may suggest user want fruit apple. There are many things to help
detect user's demand -- query expansion(Searches related to in google)
suggests when user type .. The ultimate goal is understanding
intension by analyzing users' natural language.

Another is "understanding" documents. Current models such as VSM
don't understanding document. it just regards documents as words'
collections. when users input a word, it returns documents contains
this word(tf). of course idf is also taken into consideration.
But it's far from understanding. That's why Keyword stuffing comes
out. Because machine don't really understanding the document and can't
judge whether the document is good or bad or whether it matchs query
good or bad
So PageRank and some other external informations are used to
relieve this problem. But can't fully solve it.
To fully understand documents need more advaned NLP techs. But I
don't think it will achieve human's intelligence in near future
although I am a NLPer
Another road is human help machine "understanding", That's which
called web 2.0 social networks, semantic web ... But also not an easy
task.


2010/11/4 jayant :
>
> Does Solr support Natural Language Search? I did not find any thing about
> this in the reference manual. Please let me know.
> Thanks.
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Does-Solr-support-Natural-Language-Search-tp1839262p1839262.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

more sql-like commands for solr

2012-02-07 Thread Li Li

hi all,
we have used solr to provide searching service in many products. I
found for each product, we have to do some configurations and query
expressions.
our users are not used to this. they are familiar with sql and they may
describe like this: I want a query that can search books whose title
contains java, and I will group these books by publishing year and order by
matching score and freshness, the weight of score is 2 and the weight of
freshness is 1.
maybe they will be happy if they can use sql like statements to convey
their needs.
select * from books where title contains java group by pub_year order
by score^2, freshness^1
also they may like they can insert or delete documents by delete from
books where title contains java and pub_year between 2011 and 2012.
we can define some language similar to sql and translate the to solr
query string such as .../select/?q=+title:java^2 +pub_year:2011
this may be equivalent to apache hive for hadoop.

Re: Chinese Phonetic search

2012-02-07 Thread Li Li

you can convert Chinese words to pinyin and use n-gram to search phonetic
similar words

On Wed, Feb 8, 2012 at 11:10 AM, Floyd Wu  wrote:

> Hi there,
>
> Does anyone here ever implemented phonetic search especially with
> Chinese(traditional/simplified) using SOLR or Lucene?
>
> Please share some thought or point me a possible solution. (hint me search
> keywords)
>
> I've searched and read lot of related articles but have no luck.
>
> Many thanks.
>
> Floyd
>

Re: New segment file created too often

2012-02-13 Thread Li Li

 Commit is called
after adding each document


 you should add enough documents and then calling a commit. commit is a
cost operation.
 if you want to get latest feeded documents, you could use NRT

On Tue, Feb 14, 2012 at 12:47 AM, Huy Le  wrote:

> Hi,
>
> I am using solr 3.5.  I seeing solr keeps creating new segment files (<1MB
> files) so often that it triggers segment merge about every one minute. I
> search the news archive, but could not find any info on this issue.  I am
> indexing about 10 docs of less 2KB each every second.  Commit is called
> after adding each document. Relevant config params are:
>
> 10
> 1024
> 2147483647
>
> What might be triggering this frequent new segment files creation?  Thanks!
>
> Huy
>
> --
> Huy Le
> Spring Partners, Inc.
> http://springpadit.com
>

Re: New segment file created too often

2012-02-13 Thread Li Li

as far as I know, there are three situation it will be flushed to a new
segment: RAM buffer for posting data structure is used up; added doc
numbers are exceeding threshold and there are many deletions in a segment
but your configuration seems it is not likely to flush many small segments.

1024
2147483647
On Tue, Feb 14, 2012 at 1:10 AM, Huy Le  wrote:

> Hi,
>
> I am using solr 3.5.  As I understood it, NRT is a solr 4 feature, but solr
> 4 is not released yet.
>
> I understand commit after adding each document is expensive, but the
> application requires that documents be available after adding to the index.
>
> What I don't understand is why new segment files are created so often.
> Are the commit calls triggering new segment files being created?  I don't
> see this behavior in another environment of the same version of solr.
>
> Huy
>
> On Mon, Feb 13, 2012 at 11:55 AM, Li Li  wrote:
>
> >  Commit is called
> > after adding each document
> >
> >
> >  you should add enough documents and then calling a commit. commit is a
> > cost operation.
> >  if you want to get latest feeded documents, you could use NRT
> >
> > On Tue, Feb 14, 2012 at 12:47 AM, Huy Le 
> wrote:
> >
> > > Hi,
> > >
> > > I am using solr 3.5.  I seeing solr keeps creating new segment files
> > (<1MB
> > > files) so often that it triggers segment merge about every one minute.
> I
> > > search the news archive, but could not find any info on this issue.  I
> am
> > > indexing about 10 docs of less 2KB each every second.  Commit is called
> > > after adding each document. Relevant config params are:
> > >
> > > 10
> > > 1024
> > > 2147483647
> > >
> > > What might be triggering this frequent new segment files creation?
> >  Thanks!
> > >
> > > Huy
> > >
> > > --
> > > Huy Le
> > > Spring Partners, Inc.
> > > http://springpadit.com
> > >
> >
>
>
>
> --
> Huy Le
> Spring Partners, Inc.
> http://springpadit.com
>

Re: New segment file created too often

2012-02-13 Thread Li Li

can you post your config file?
I found there are 2 places to config ramBufferSizeMB in latest svn of 3.6's
example solrconfig.xml. trying to modify them both?

  

false

10

32



1
1000

.

  

  
  

false
32
10
   
  

On Tue, Feb 14, 2012 at 1:10 AM, Huy Le  wrote:

> Hi,
>
> I am using solr 3.5.  As I understood it, NRT is a solr 4 feature, but solr
> 4 is not released yet.
>
> I understand commit after adding each document is expensive, but the
> application requires that documents be available after adding to the index.
>
> What I don't understand is why new segment files are created so often.
> Are the commit calls triggering new segment files being created?  I don't
> see this behavior in another environment of the same version of solr.
>
> Huy
>
> On Mon, Feb 13, 2012 at 11:55 AM, Li Li  wrote:
>
> >  Commit is called
> > after adding each document
> >
> >
> >  you should add enough documents and then calling a commit. commit is a
> > cost operation.
> >  if you want to get latest feeded documents, you could use NRT
> >
> > On Tue, Feb 14, 2012 at 12:47 AM, Huy Le 
> wrote:
> >
> > > Hi,
> > >
> > > I am using solr 3.5.  I seeing solr keeps creating new segment files
> > (<1MB
> > > files) so often that it triggers segment merge about every one minute.
> I
> > > search the news archive, but could not find any info on this issue.  I
> am
> > > indexing about 10 docs of less 2KB each every second.  Commit is called
> > > after adding each document. Relevant config params are:
> > >
> > > 10
> > > 1024
> > > 2147483647
> > >
> > > What might be triggering this frequent new segment files creation?
> >  Thanks!
> > >
> > > Huy
> > >
> > > --
> > > Huy Le
> > > Spring Partners, Inc.
> > > http://springpadit.com
> > >
> >
>
>
>
> --
> Huy Le
> Spring Partners, Inc.
> http://springpadit.com
>

Re: Can I rebuild an index and remove some fields?

2012-02-13 Thread Li Li

method1, dumping data
for stored fields, you can traverse the whole index and save it to
somewhere else.
for indexed but not stored fields, it may be more difficult.
if the indexed and not stored field is not analyzed(fields such as id),
it's easy to get from FieldCache.StringIndex.
But for analyzed fields, though theoretically it can be restored from
term vector and term position, it's hard to recover from index.

method 2, hack with metadata
1. indexed fields
  delete by query, e.g. field:*
2. stored fields
   because all fields are stored sequentially. it's not easy to delete
some fields. this will not affect search speed. but if you want to get
stored fields,  and the useless fields are very long, then it will slow
down.
   also it's possible to hack with it. but need more effort to
understand the index file format  and traverse the fdt/fdx file.
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html

this will give you some insight.

On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart wrote:

> Lets say I have a large index (100M docs, 1TB, split up between 10
> indexes).  And a bunch of the "stored" and "indexed" fields are not used in
> search at all.  In order to save memory and disk, I'd like to rebuild that
> index *without* those fields, but I don't have original documents to
> rebuild entire index with (don't have the full-text anymore, etc.).  Is
> there some way to rebuild or optimize an existing index with only a sub-set
> of the existing indexed fields?  Or alternatively is there a way to avoid
> loading some indexed fields at all ( to avoid loading term infos and terms
> index ) ?
>
> Thanks
> Bob

Re: Can I rebuild an index and remove some fields?

2012-02-13 Thread Li Li

for method 2, delete is wrong. we can't delete terms.
   you also should hack with the tii and tis file.

On Tue, Feb 14, 2012 at 2:46 PM, Li Li  wrote:

> method1, dumping data
> for stored fields, you can traverse the whole index and save it to
> somewhere else.
> for indexed but not stored fields, it may be more difficult.
> if the indexed and not stored field is not analyzed(fields such as
> id), it's easy to get from FieldCache.StringIndex.
> But for analyzed fields, though theoretically it can be restored from
> term vector and term position, it's hard to recover from index.
>
> method 2, hack with metadata
> 1. indexed fields
>   delete by query, e.g. field:*
> 2. stored fields
>because all fields are stored sequentially. it's not easy to delete
> some fields. this will not affect search speed. but if you want to get
> stored fields,  and the useless fields are very long, then it will slow
> down.
>also it's possible to hack with it. but need more effort to
> understand the index file format  and traverse the fdt/fdx file.
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>
> this will give you some insight.
>
>
> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart wrote:
>
>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> indexes).  And a bunch of the "stored" and "indexed" fields are not used in
>> search at all.  In order to save memory and disk, I'd like to rebuild that
>> index *without* those fields, but I don't have original documents to
>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
>> there some way to rebuild or optimize an existing index with only a sub-set
>> of the existing indexed fields?  Or alternatively is there a way to avoid
>> loading some indexed fields at all ( to avoid loading term infos and terms
>> index ) ?
>>
>> Thanks
>> Bob
>
>
>

Re: Can I rebuild an index and remove some fields?

2012-02-14 Thread Li Li

I have roughly read the codes of 4.0 trunk. maybe it's feasible.
SegmentMerger.add(IndexReader) will add to be merged Readers
merge() will call
  mergeTerms(segmentWriteState);
  mergePerDoc(segmentWriteState);

   mergeTerms() will construct fields from IndexReaders
for(int
readerIndex=0;readerIndexwrote:

> I was thinking if I make a wrapper class that aggregates another
> IndexReader and filter out terms I don't want anymore it might work.   And
> then pass that wrapper into SegmentMerger.  I think if I filter out terms
> on GetFieldNames(...) and Terms(...) it might work.
>
> Something like:
>
> HashSet ignoredTerms=...;
>
> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>
> SegmentMerger merger=new SegmentMerger(writer);
>
> merger.add(wrapper);
>
> merger.Merge();
>
>
>
>
>
> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>
> > for method 2, delete is wrong. we can't delete terms.
> >   you also should hack with the tii and tis file.
> >
> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li  wrote:
> >
> >> method1, dumping data
> >> for stored fields, you can traverse the whole index and save it to
> >> somewhere else.
> >> for indexed but not stored fields, it may be more difficult.
> >>if the indexed and not stored field is not analyzed(fields such as
> >> id), it's easy to get from FieldCache.StringIndex.
> >>But for analyzed fields, though theoretically it can be restored from
> >> term vector and term position, it's hard to recover from index.
> >>
> >> method 2, hack with metadata
> >> 1. indexed fields
> >>  delete by query, e.g. field:*
> >> 2. stored fields
> >>   because all fields are stored sequentially. it's not easy to
> delete
> >> some fields. this will not affect search speed. but if you want to get
> >> stored fields,  and the useless fields are very long, then it will slow
> >> down.
> >>   also it's possible to hack with it. but need more effort to
> >> understand the index file format  and traverse the fdt/fdx file.
> >>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
> >>
> >> this will give you some insight.
> >>
> >>
> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart  >wrote:
> >>
> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
> used in
> >>> search at all.  In order to save memory and disk, I'd like to rebuild
> that
> >>> index *without* those fields, but I don't have original documents to
> >>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
> >>> there some way to rebuild or optimize an existing index with only a
> sub-set
> >>> of the existing indexed fields?  Or alternatively is there a way to
> avoid
> >>> loading some indexed fields at all ( to avoid loading term infos and
> terms
> >>> index ) ?
> >>>
> >>> Thanks
> >>> Bob
> >>
> >>
> >>
>
>

Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Li Li

great. I think you could make it a public tool. maybe others also need such
functionality.

On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart wrote:

> I implemented an index shrinker and it works.  I reduced my test index
> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
> need anymore.  I'm actually using Lucene.Net for this project so code
> is C# using Lucene.Net 2.9.2 API.  But basic idea is:
>
> Create an IndexReader wrapper that only enumerates the terms you want
> to keep, and that removes terms from documents when returning
> documents.
>
> Use the SegmentMerger to re-write each segment (where each segment is
> wrapped by the wrapper class), writing new segment to a new directory.
> Collect the SegmentInfos and do a commit in order to create a new
> segments file in new index directory
>
> Done - you now have a shrunk index with specified terms removed.
>
> Implementation uses separate thread for each segment, so it re-writes
> them in parallel.  Took about 15 minutes to do 770,000 doc index on my
> macbook.
>
>
> On Tue, Feb 14, 2012 at 10:12 PM, Li Li  wrote:
> > I have roughly read the codes of 4.0 trunk. maybe it's feasible.
> >SegmentMerger.add(IndexReader) will add to be merged Readers
> >merge() will call
> >  mergeTerms(segmentWriteState);
> >  mergePerDoc(segmentWriteState);
> >
> >   mergeTerms() will construct fields from IndexReaders
> >for(int
> > readerIndex=0;readerIndex >  final MergeState.IndexReaderAndLiveDocs r =
> > mergeState.readers.get(readerIndex);
> >  final Fields f = r.reader.fields();
> >  final int maxDoc = r.reader.maxDoc();
> >  if (f != null) {
> >slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
> >fields.add(f);
> >  }
> >  docBase += maxDoc;
> >}
> >So If you wrapper your IndexReader and override its fields() method,
> > maybe it will work for merge terms.
> >
> >for DocValues, it can also override AtomicReader.docValues(). just
> > return null for fields you want to remove. maybe it should
> > traverse CompositeReader's getSequentialSubReaders() and wrapper each
> > AtomicReader
> >
> >other things like term vectors norms are similar.
> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart  >wrote:
> >
> >> I was thinking if I make a wrapper class that aggregates another
> >> IndexReader and filter out terms I don't want anymore it might work.
> And
> >> then pass that wrapper into SegmentMerger.  I think if I filter out
> terms
> >> on GetFieldNames(...) and Terms(...) it might work.
> >>
> >> Something like:
> >>
> >> HashSet ignoredTerms=...;
> >>
> >> FilteringIndexReader wrapper=new FilterIndexReader(reader);
> >>
> >> SegmentMerger merger=new SegmentMerger(writer);
> >>
> >> merger.add(wrapper);
> >>
> >> merger.Merge();
> >>
> >>
> >>
> >>
> >>
> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
> >>
> >> > for method 2, delete is wrong. we can't delete terms.
> >> >   you also should hack with the tii and tis file.
> >> >
> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li  wrote:
> >> >
> >> >> method1, dumping data
> >> >> for stored fields, you can traverse the whole index and save it to
> >> >> somewhere else.
> >> >> for indexed but not stored fields, it may be more difficult.
> >> >>if the indexed and not stored field is not analyzed(fields such as
> >> >> id), it's easy to get from FieldCache.StringIndex.
> >> >>But for analyzed fields, though theoretically it can be restored
> from
> >> >> term vector and term position, it's hard to recover from index.
> >> >>
> >> >> method 2, hack with metadata
> >> >> 1. indexed fields
> >> >>  delete by query, e.g. field:*
> >> >> 2. stored fields
> >> >>   because all fields are stored sequentially. it's not easy to
> >> delete
> >> >> some fields. this will not affect search speed. but if you want to
> get
> >> >> stored fields,  and the useless fields are very long, then it will
> slow
> >> >> down.
> >> >>   also it's possible to hack with it. but need more effort to
> >> >> understand the index file format  and traverse the fdt/fd

Re: Sort by the number of matching terms (coord value)

2012-02-16 Thread Li Li

you can fool the lucene scoring fuction. override each function such as idf
queryNorm lengthNorm and let them simply return 1.0f.
I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only
score by vector space model and the formula can't be replaced by users.

On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark  wrote:

> Hi,
>
> I'm looking for a way to sort results by the number of matching terms.
> Being able to sort by the coord() value or by the overlap value that gets
> passed into the coord() function would do the trick. Is there a way I can
> expose those values to the sort function?
>
> I'd appreciate any help that points me in the right direction. I'm OK with
> making basic code modifications.
>
> Thanks!
>
> -Nick
>

Re: Fw:how to make fdx file

2012-03-04 Thread Li Li

lucene will never modify old segment files, it just flushes into a new
segment or merges old segments into new one. after merging, old segments
will be deleted.
once a file(such as fdt and fdx) is generated. it will never be
re-generated. the only possible is that in the generating stage, there is
something wrong. or it's deleted by other programs such as wrongly deleted
by human.

On Sat, Mar 3, 2012 at 2:33 PM, C.Yunqin <345804...@qq.com> wrote:

> yes,the fdt file still is there.  can i make new fdx file through fdt file.
>  is there a posibilty that  during the process of updating and optimizing,
> the index will be deleted then re-generated?
>
>
>
>  -- Original --
>  From:  "Erick Erickson";
>  Date:  Sat, Mar 3, 2012 08:28 AM
>  To:  "solr-user";
>
>  Subject:  Re: Fw:how to make fdx file
>
>
> As far as I know, fdx files don't just disappear, so I can only assume
> that something external removed it.
>
> That said, if you somehow re-indexed and had no fields where
> stored="true", then the fdx file may not be there.
>
> Are you seeing problems as a result? This file is used to store
> index information for stored fields. Do you have an fdt file?
>
> Best
> Erick
>
> On Fri, Mar 2, 2012 at 2:48 AM, C.Yunqin <345804...@qq.com> wrote:
> > Hi ,
> >   my fdx file was unexpected gone, then the solr sever stop running;
> what I can do to recover solr?
> >
> >  Other files still exist.
> >
> >  Thanks very much
> >
> >
> > 
>

Re: How to limit the number of open searchers?

2012-03-06 Thread Li Li

what do u mean "programmatically"? modify codes of solr? becuase solr is
not like lucene, it only provide http interfaces for its users other than
java api

if you want to modify solr, you can find codes in SolrCore
private final LinkedList> _searchers = new
LinkedList>();
and _searcher is current searcher.
be careful to use searcherLock to synchronizing your codes.
maybe you can write your codes like:

synchronized(searcherLock){
if(_searchers.size==1){
...
}
}

On Tue, Mar 6, 2012 at 3:18 AM, Michael Ryan  wrote:

> Is there a way to limit the number of searchers that can be open at a
> given time?  I know there is a maxWarmingSearchers configuration that
> limits the number of warming searchers, but that's not quite what I'm
> looking for...
>
> Ideally, when I commit, I want there to only be one searcher open before
> the commit, so that during the commit and warming, there is a max of two
> searchers open.  I'd be okay with delaying the commit until there is only
> one searcher open.  Is there a way to programmatically determine how many
> searchers are currently open?
>
> -Michael
>

Re: index size with replication

2012-03-13 Thread Li Li

 optimize will generate new segments and delete old ones. if your master
also provides searching service during indexing, the old files may be
opened by old SolrIndexSearcher. they will be deleted later. So when
indexing, the index size may double. But a moment later, old indexes will
be deleted.

On Wed, Mar 14, 2012 at 7:06 AM, Mike Austin  wrote:

> I have a master with two slaves.  For some reason on the master if I do an
> optimize after indexing on the master it double in size from 42meg to 90
> meg.. however,  when the slaves replicate they get the 42meg index..
>
> Should the master and slaves always be the same size?
>
> Thanks,
> Mike
>

Re: Sorting on non-stored field

2012-03-14 Thread Li Li

it should be indexed by not analyzed. it don't need stored.
reading field values from stored fields is extremely slow.
So lucene will use StringIndex to read fields for sort. so if you want to
sort by some field, you should index this field and don't analyze it.

On Wed, Mar 14, 2012 at 6:43 PM, Finotti Simone  wrote:

> I was wondering: is it possible to sort a Solr result-set on a non-stored
> value?
>
> Thank you

Re: How to avoid the unexpected character error?

2012-03-14 Thread Li Li

There is a class org.apache.solr.common.util.XML in solr
you can use this wrapper:
public static String escapeXml(String s) throws IOException{
StringWriter sw=new StringWriter();
XML.escapeCharData(s, sw);
return sw.getBuffer().toString();
}

On Wed, Mar 14, 2012 at 4:34 PM, neosky  wrote:

> I use the xml to index the data. One filed might contains some characters
> like '' <=>
> It seems that will produce the error
> I modify that filed doesn't index, but it doesn't work. I need to store the
> filed, but index might not be indexed.
> Thanks!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-avoid-the-unexpected-character-error-tp3824726p3824726.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: How to avoid the unexpected character error?

2012-03-14 Thread Li Li

no, it's nothing to do with schema.xml
post.jar just post a file, it don't parse this file.
solr will use xml parser to parse this file. if you don't escape special
characters, it's not a valid xml file and solr will throw exceptions.

On Thu, Mar 15, 2012 at 12:33 AM, neosky  wrote:

> Thanks!
> Does the schema.xml support this parameter? I am using the example post.jar
> to index my file.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-avoid-the-unexpected-character-error-tp3824726p3825959.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr out of memory exception

2012-03-14 Thread Li Li

how many memory are allocated to JVM?

On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar  wrote:

> Solr is giving out of memory exception. Full Indexing was completed fine.
> Later while searching maybe when it tries to load the results in memory it
> starts giving this exception. Though with the same memory allocated to
> Tomcat and exactly same solr replica on another server it is working
> perfectly fine. I am working on 64 bit software's including Java & Tomcat
> on Windows.
> Any help would be appreciated.
>
> Here are the logs:
>
> The server encountered an internal error (Severe errors in solr
> configuration. Check your log files for more detailed information on what
> may be wrong. If you want solr to continue after configuration errors,
> change: false in
> null -
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at
> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
> org.apache.solr.core.SolrCore.(SolrCore.java:579) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
> at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
> at
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
> at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
> at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at
> org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at
> org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at
> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
> at
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at
> org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at
> org.apache.catalina.core.StandardService.start(StandardService.java:525) at
> org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at
> org.apache.catalina.startup.Catalina.start(Catalina.java:595) at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
> java.lang.reflect.Method.invoke(Unknown Source) at
> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
> java.lang.OutOfMemoryError: Java heap space at
> org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180)
> at org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:91)
> at
> org.apache.lucene.index.SegmentReader$CoreReaders.(SegmentReader.java:122)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:652) at
> org.apache.lucene.index.SegmentReader.get(SegmentReader.java:613) at
> org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:104) at
> org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:27)
> at
> org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:74)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
> at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:476) at
> org.apache.lucene.index.IndexReader.open(IndexReader.java:403) at
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057) at
> org.apache.solr.core.SolrCore.(SolrCore.java:579) at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
> at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
> at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
> at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
> at
> org.apache.catalina.core.StandardContext.fil

Re: Solr out of memory exception

2012-03-14 Thread Li Li

it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB).
you should enable pointer compression by -XX:+UseCompressedOops

On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar  wrote:

> Thanks for helping me out.
>
> I have allocated Xms-2.0GB Xmx-2.0GB
>
> However i see Tomcat is still using pretty less memory and not 2.0G
>
> Total Memory on my Windows Machine = 4GB.
>
> With smaller index size it is working perfectly fine. I was thinking of
> increasing the system RAM & tomcat heap space allocated but then how come
> on a different server with exactly same system and solr configuration &
> memory it is working fine?
>
>
> -Original Message-
> From: Li Li [mailto:fancye...@gmail.com]
> Sent: Thursday, March 15, 2012 11:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr out of memory exception
>
> how many memory are allocated to JVM?
>
> On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar 
> wrote:
>
> > Solr is giving out of memory exception. Full Indexing was completed fine.
> > Later while searching maybe when it tries to load the results in memory
> it
> > starts giving this exception. Though with the same memory allocated to
> > Tomcat and exactly same solr replica on another server it is working
> > perfectly fine. I am working on 64 bit software's including Java & Tomcat
> > on Windows.
> > Any help would be appreciated.
> >
> > Here are the logs:
> >
> > The server encountered an internal error (Severe errors in solr
> > configuration. Check your log files for more detailed information on what
> > may be wrong. If you want solr to continue after configuration errors,
> > change: false in
> > null -
> > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> at
> > org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
> > org.apache.solr.core.SolrCore.(SolrCore.java:579) at
> >
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> > at
> >
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
> > at
> >
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
> > at
> >
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
> > at
> >
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
> > at
> > org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
> > at
> >
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
> > at
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
> > at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
> at
> > org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at
> > org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at
> > org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at
> > org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
> >
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
> > at
> >
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
> > at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065)
> at
> > org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
> > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at
> > org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at
> > org.apache.catalina.core.StandardService.start(StandardService.java:525)
> at
> > org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at
> > org.apache.catalina.startup.Catalina.start(Catalina.java:595) at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> > sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
> > java.lang.reflect.Method.invoke(Unknown Source) at
> > org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
> > org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
> > java.lang.OutOfMemoryError: Java heap space at
> >
> org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180)
> > at
> org.apache.lucene.index.TermInfosReader.(TermInfosReader.java:91)
&

Re: Solr out of memory exception

2012-03-15 Thread Li Li

it can reduce memory usage. for small heap application less than 4GB, it
may speed up.
but be careful, for large heap application, it depends.
you should do some test for yourself.
our application's test result is: it reduce memory usage but enlarge
response time. we use 25GB memory.

http://lists.apple.com/archives/java-dev/2010/Apr/msg00157.html

Dyer, James james.d...@ingrambook.com
via<http://support.google.com/mail/bin/answer.py?hl=en&ctx=mail&answer=1311182>
 lucene.apache.org
3/18/11

to solr-user
Our tests showed, in our situation, the "compressed oops" flag caused our
minor (ParNew) generation time to decrease significantly.   We're using a
larger heap (22gb) and our index size is somewhere in the 40's gb total.  I
guess with any of these jvm parameters, it all depends on your situation
and you need to test.  In our case, this flag solved a real problem we were
having.  Whoever wrote the JRocket book you refer to no doubt had other
scenarios in mind...

On Thu, Mar 15, 2012 at 3:02 PM, C.Yunqin <345804...@qq.com> wrote:

> why should enable pointer compression?
>
>
>
>
> -- Original --
> From:  "Li Li";
> Date:  Thu, Mar 15, 2012 02:41 PM
> To:  "Husain, Yavar";
> Cc:  "solr-user@lucene.apache.org";
> Subject:  Re: Solr out of memory exception
>
>
> it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB).
> you should enable pointer compression by -XX:+UseCompressedOops
>
> On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar 
> wrote:
>
> > Thanks for helping me out.
> >
> > I have allocated Xms-2.0GB Xmx-2.0GB
> >
> > However i see Tomcat is still using pretty less memory and not 2.0G
> >
> > Total Memory on my Windows Machine = 4GB.
> >
> > With smaller index size it is working perfectly fine. I was thinking of
> > increasing the system RAM & tomcat heap space allocated but then how come
> > on a different server with exactly same system and solr configuration &
> > memory it is working fine?
> >
> >
> > -Original Message-
> > From: Li Li [mailto:fancye...@gmail.com]
> > Sent: Thursday, March 15, 2012 11:11 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr out of memory exception
> >
> > how many memory are allocated to JVM?
> >
> > On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar 
> > wrote:
> >
> > > Solr is giving out of memory exception. Full Indexing was completed
> fine.
> > > Later while searching maybe when it tries to load the results in memory
> > it
> > > starts giving this exception. Though with the same memory allocated to
> > > Tomcat and exactly same solr replica on another server it is working
> > > perfectly fine. I am working on 64 bit software's including Java &
> Tomcat
> > > on Windows.
> > > Any help would be appreciated.
> > >
> > > Here are the logs:
> > >
> > > The server encountered an internal error (Severe errors in solr
> > > configuration. Check your log files for more detailed information on
> what
> > > may be wrong. If you want solr to continue after configuration errors,
> > > change: false in
> > > null -
> > > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> > at
> > > org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
> > > org.apache.solr.core.SolrCore.(SolrCore.java:579) at
> > >
> >
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
> > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
> > > at
> > >
> >
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
> > > at
> > >
> >
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
> > > at
> > >
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
> > > at
> > >
> >
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
> > > at
> > org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
> > > at
> org.apache.catalina.c

Re: How to avoid the unexpected character error?

2012-03-15 Thread Li Li

it's not the right place.
when you use java -Durl=http://... -jar post.jar data.xml
the data.xml file must be a valid xml file. you shoud escape special chars
in this file.
I don't know how you generate this file.
if you use java program(or other scripts) to generate this file, you should
use xml tools to generate this file.
but if you generate like this:
StringBuilder buf=new StringBuilder();
buf.append("");
buf.append("");
buf.append("text content");
you should escape special chars.
if you use java, you can make use of org.apache.solr.common.util.XML class

On Fri, Mar 16, 2012 at 2:03 PM, neosky  wrote:

> I am sorry, but I can't get what you mean.
> I tried the  HTMLStripCharFilter and PatternReplaceCharFilter. It doesn't
> work.
> Could you give me an example? Thanks!
>
>   positionIncrementGap="100">
>   
> 
> 
>   
>  
>
> I also tried:
>
>  replacement=""
> maxBlockChars="1" blockDelimiters="|"/>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-avoid-the-unexpected-character-error-tp3824726p3831064.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Trouble Setting Up Development Environment

2012-03-23 Thread Li Li

here is my method.
1. check out latest source codes from trunk or download tar ball
svn checkout http://svn.apache.org/repos/asf/lucene/dev/trunklucene_trunk

2. create a dynamic web project in eclipse and close it.
   for example, I create a project name lucene-solr-trunk in my
workspace.

3. copy/mv the source code to this project(it's not necessary)
   here is my directory structure
   lili@lili-desktop:~/workspace/lucene-solr-trunk$ ls
bin.tests-framework  build  lucene_trunk  src  testindex  WebContent
  lucene_trunk is the top directory checked out from svn in step 1.
4. remove WebContent generated by eclipse and modify it to a soft link to
  lili@lili-desktop:~/workspace/lucene-solr-trunk$ ll WebContent
lrwxrwxrwx 1 lili lili 28 2011-08-18 18:50 WebContent ->
lucene_trunk/solr/webapp/web/
5. open lucene_trunk/dev-tools/eclipse/dot.classpath. copy all lines like
kind="src" to a temp file

6. replace all string like path="xxx" to path="lucene_trunk/xxx" and copy
them into .classpath file
7. mkdir WebContent/WEB-INF/lib
8. extract all jar file in dot.classpath to WebContent/WEB-INF/lib
I use this command:
lili@lili-desktop:~/workspace/lucene-solr-trunk/lucene_trunk$ cat
dev-tools/eclipse/dot.classpath |grep "kind=\"lib"|awk -F "path=\"" '{print
$2}' |awk -F "\"/>" '{print $1}' |xargs cp ../WebContent/WEB-INF/lib/
9. open this project and refresh it.
if everything is ok, it will compile all java files successfully. if
there is something wrong, Probably we don't use the correct jar. because
there are many versions of the same library.
10. right click the project -> debug As -> debug on Server
it will fail because no solr home is specified.
11. right click the project -> debug As -> debug Configuration -> Arguments
Tab -> VM arguments
 add
-Dsolr.solr.home=/home/lili/workspace/lucene-solr-trunk/lucene_trunk/solr/example/solr
 you can also add other vm arguments like -Xmx1g here.
12. all fine, add a break point at SolrDispatchFilter.doFilter(). all solr
request comes here
13. have fun~

On Fri, Mar 23, 2012 at 11:49 AM, Karthick Duraisamy Soundararaj <
karthick.soundara...@gmail.com> wrote:

> Hi Solr Ppl,
>I have been trying to set up solr dev env. I downloaded the
> tar ball of eclipse and the solr 3.5 source. Here are the exact sequence of
> steps I followed
>
> I extracted the solr 3.5 source and eclipse.
> I installed run-jetty-run plugin for eclipse.
> I ran ant eclipse in the solr 3.5 source directory
> I used eclipse's "Open existing project" option to open up the files in
> solr 3.5 directory. I got a huge tree in the name of lucene_solr.
>
> I run it and there is a SEVERE error: System property not set excetption. *
> solr*.test.sys.*prop1* not set and then the jetty loads solr. I then try
> localhost:8080/solr/select/ I get null pointer execpiton. I am only able to
> access admin page.
>
> Is there anything else I need to do?
>
> I tried to follow
>
> http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
> .
> But I dont find the solr-3.5.war file. I tried ant dist to generate the
> dist folder but that has many jars and wars..
>
> I am able to compile the source with ant compile, get the solr in example
> directory up and running.
>
> Will be great if someone can help me with this.
>
> Thanks,
> Karthick
>

Re: Trouble Setting Up Development Environment

2012-03-24 Thread Li Li

le/lib/jsp-2.1/jsp-api-2.1-glassfish-2.1.v20091210.jar
>> will not be exported or published. Runtime ClassNotFoundExceptions may
>> result.  solr3_5P/solr3_5Classpath Dependency Validator
>> Message
>> Classpath entry
>> /solr3_5/ssrc/solr/example/lib/servlet-api-2.5-20081211.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/apache-solr-noggit-r1099557.jar
>> will not be exported or published. Runtime ClassNotFoundExceptions may
>> result.  solr3_5P/solr3_5Classpath Dependency Validator
>> Message
>> Classpath entry /solr3_5/ssrc/solr/lib/commons-codec-1.5.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry
>> /solr3_5/ssrc/solr/lib/commons-csv-1.0-SNAPSHOT-r966014.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/commons-fileupload-1.2.1.jar will
>> not be exported or published. Runtime ClassNotFoundExceptions may result.
>> solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/commons-httpclient-3.1.jar will
>> not be exported or published. Runtime ClassNotFoundExceptions may result.
>> solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/commons-io-1.4.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/commons-lang-2.4.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/easymock-2.2.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry
>> /solr3_5/ssrc/solr/lib/geronimo-stax-api_1.0_spec-1.0.1.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/guava-r05.jar will not be exported
>> or published. Runtime ClassNotFoundExceptions may result.  solr3_5
>>  P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/jcl-over-slf4j-1.6.1.jar will not
>> be exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/junit-4.7.jar will not be exported
>> or published. Runtime ClassNotFoundExceptions may result.  solr3_5
>>  P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/servlet-api-2.4.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/slf4j-api-1.6.1.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/slf4j-jdk14-1.6.1.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>> Classpath entry /solr3_5/ssrc/solr/lib/wstx-asl-3.2.7.jar will not be
>> exported or published. Runtime ClassNotFoundExceptions may result.
>>  solr3_5P/solr3_5Classpath Dependency Validator Message
>>
>
>
> On Fri, Mar 23, 2012 at 3:25 AM, Li Li  wrote:
>
>> here is my method.
>> 1. check out latest source codes from trunk or download tar ball
>>svn checkout
>> http://svn.apache.org/repos/asf/lucene/dev/trunklucene_trunk
>>
>> 2. create a dynamic web project in eclipse and close it.
>>   for example, I create a project name lucene-solr-trunk in my
>> workspace.
>>
>> 3. copy/mv the source code to this project(it's not necessary)
>>   here is my directory structure
>>   lili@lili-desktop:~/workspace/lucene-solr-trunk$ ls
>> bin.tests-framework  build  lucene

Re: using solr to do a 'match'

2012-04-10 Thread Li Li

it's not possible now because lucene don't support this.
when doing disjunction query, it only record how many terms match this
document.
I think this is a common requirement for many users.
I suggest lucene should divide scorer to a matcher and a scorer.
the matcher just return which doc is matched and why/how the doc is matched.
especially for disjuction query, it should tell which term matches and
possible other
information such as tf/idf and the distance of terms(to support proximity
search).
That's the matcher's job. and then the scorer(a ranking algorithm) use
flexible algorithm
to score this document and the collector can collect it.

On Wed, Apr 11, 2012 at 10:28 AM, Chris Book  wrote:

> Hello, I have a solr index running that is working very well as a search.
>  But I want to add the ability (if possible) to use it to do matching.  The
> problem is that by default it is only looking for all the input terms to be
> present, and it doesn't give me any indication as to how many terms in the
> target field were not specified by the input.
>
> For example, if I'm trying to match to the song title "dust in the wind",
> I'm correctly getting a match if the input query is "dust in wind".  But I
> don't want to get a match if the input is just "dust".  Although as a
> search "dust" should return this result, I'm looking for some way to filter
> this out based on some indication that the input isn't close enough to the
> output.  Perhaps if I could get information that that the number of input
> terms is much less than the number of terms in the field.  Or something
> else along those line?
>
> I realize that this isn't the typical use case for a search, but I'm just
> looking for some suggestions as to how I could improve the above example a
> bit.
>
> Thanks,
> Chris
>

Re: using solr to do a 'match'

2012-04-11 Thread Li Li

I searched my mail but nothing found.
the thread searched by key words "boolean expression" is Indexing Boolean
Expressions from joaquin.delgado
to tell which terms are matched, for BooleanScorer2, a simple method is to
modify DisjunctionSumScorer and add a BitSet to record matched scorers.
When collector collect this document, it can get the scorer and recursively
find the matched terms.
But I think maybe it's better to add a component maybe named matcher that
do the matching job, and scorer use the information from the matcher and do
ranking things.

On Wed, Apr 11, 2012 at 4:32 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hi,
>
> This use case is similar to matching boolean expression problem. You can
> find recent thread about it. I have an idea that we can introduce
> disjunction query with dynamic mm (minShouldMatch parameter
>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)
> )
> i.e. 'match these clauses disjunctively but for every document use
> value
> from field cache of field xxxCount as a minShouldMatch parameter'. Also
> norms can be used as a source for dynamics mm values.
>
> Wdyt?
>
> On Wed, Apr 11, 2012 at 10:08 AM, Li Li  wrote:
>
> > it's not possible now because lucene don't support this.
> > when doing disjunction query, it only record how many terms match this
> > document.
> > I think this is a common requirement for many users.
> > I suggest lucene should divide scorer to a matcher and a scorer.
> > the matcher just return which doc is matched and why/how the doc is
> > matched.
> > especially for disjuction query, it should tell which term matches and
> > possible other
> > information such as tf/idf and the distance of terms(to support proximity
> > search).
> > That's the matcher's job. and then the scorer(a ranking algorithm) use
> > flexible algorithm
> > to score this document and the collector can collect it.
> >
> > On Wed, Apr 11, 2012 at 10:28 AM, Chris Book 
> wrote:
> >
> > > Hello, I have a solr index running that is working very well as a
> search.
> > >  But I want to add the ability (if possible) to use it to do matching.
> >  The
> > > problem is that by default it is only looking for all the input terms
> to
> > be
> > > present, and it doesn't give me any indication as to how many terms in
> > the
> > > target field were not specified by the input.
> > >
> > > For example, if I'm trying to match to the song title "dust in the
> wind",
> > > I'm correctly getting a match if the input query is "dust in wind".
>  But
> > I
> > > don't want to get a match if the input is just "dust".  Although as a
> > > search "dust" should return this result, I'm looking for some way to
> > filter
> > > this out based on some indication that the input isn't close enough to
> > the
> > > output.  Perhaps if I could get information that that the number of
> input
> > > terms is much less than the number of terms in the field.  Or something
> > > else along those line?
> > >
> > > I realize that this isn't the typical use case for a search, but I'm
> just
> > > looking for some suggestions as to how I could improve the above
> example
> > a
> > > bit.
> > >
> > > Thanks,
> > > Chris
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> ge...@yandex.ru
>
> <http://www.griddynamics.com>
>  
>

Re: Solr Scoring

2012-04-13 Thread Li Li

another way is to use payload http://wiki.apache.org/solr/Payloads
the advantage of payload is that you only need one field and can make frq
file smaller than use two fields. but the disadvantage is payload is stored
in prx file, so I am not sure which one is fast. maybe you can try them
both.

On Fri, Apr 13, 2012 at 8:04 AM, Erick Erickson wrote:

> GAH! I had my head in "make this happen in one field" when I wrote my
> response, without being explicit. Of course Walter's solution is pretty
> much the standard way to deal with this.
>
> Best
> Erick
>
> On Thu, Apr 12, 2012 at 5:38 PM, Walter Underwood 
> wrote:
> > It is easy. Create two fields, text_exact and text_stem. Don't use the
> stemmer in the first chain, do use the stemmer in the second. Give the
> text_exact a bigger weight than text_stem.
> >
> > wunder
> >
> > On Apr 12, 2012, at 4:34 PM, Erick Erickson wrote:
> >
> >> No, I don't think there's an OOB way to make this happen. It's
> >> a recurring theme, "make exact matches score higher than
> >> stemmed matches".
> >>
> >> Best
> >> Erick
> >>
> >> On Thu, Apr 12, 2012 at 5:18 AM, Kissue Kissue 
> wrote:
> >>> Hi,
> >>>
> >>> I have a field in my index called itemDesc which i am applying
> >>> EnglishMinimalStemFilterFactory to. So if i index a value to this field
> >>> containing "Edges", the EnglishMinimalStemFilterFactory applies
> stemming
> >>> and "Edges" becomes "Edge". Now when i search for "Edges", documents
> with
> >>> "Edge" score better than documents with the actual search word -
> "Edges".
> >>> Is there a way i can make documents with the actual search word in this
> >>> case "Edges" score better than document with "Edge"?
> >>>
> >>> I am using Solr 3.5. My field definition is shown below:
> >>>
> >>>  positionIncrementGap="100">
> >>>  
> >>>
> >>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >>>  >>>ignoreCase="true"
> >>>words="stopwords_en.txt"
> >>>enablePositionIncrements="true"
> >>> 
> >>>
> >>>
> >>>  
> >>>  
> >>>
> >>> synonyms="synonyms.txt"
> >>> ignoreCase="true" expand="true"/>
> >>> >>>ignoreCase="true"
> >>>words="stopwords_en.txt"
> >>>enablePositionIncrements="true"
> >>>/>
> >>>
> >>>
> >>> >>> protected="protwords.txt"/>
> >>>
> >>>  
> >>>
> >>>
> >>> Thanks.
> >
> >
> >
> >
> >
>

Re: How to read SOLR cache statistics?

2012-04-13 Thread Li Li

http://wiki.apache.org/solr/SolrCaching

On Fri, Apr 13, 2012 at 2:30 PM, Kashif Khan  wrote:

> Does anyone explain what does the following parameters mean in SOLR cache
> statistics?
>
> *name*:  queryResultCache
> *class*:  org.apache.solr.search.LRUCache
> *version*:  1.0
> *description*:  LRU Cache(maxSize=512, initialSize=512)
> *stats*:  lookups : 98
> *hits *: 59
> *hitratio *: 0.60
> *inserts *: 41
> *evictions *: 0
> *size *: 41
> *warmupTime *: 0
> *cumulative_lookups *: 98
> *cumulative_hits *: 59
> *cumulative_hitratio *: 0.60
> *cumulative_inserts *: 39
> *cumulative_evictions *: 0
>
> AND also this
>
>
> *name*:  fieldValueCache
> *class*:  org.apache.solr.search.FastLRUCache
> *version*:  1.0
> *description*:  Concurrent LRU Cache(maxSize=1, initialSize=10,
> minSize=9000, acceptableSize=9500, cleanupThread=false)
> *stats*:  *lookups *: 8
> *hits *: 4
> *hitratio *: 0.50
> *inserts *: 2
> *evictions *: 0
> *size *: 2
> *warmupTime *: 0
> *cumulative_lookups *: 8
> *cumulative_hits *: 4
> *cumulative_hitratio *: 0.50
> *cumulative_inserts *: 2
> *cumulative_evictions *: 0
> *item_ABC *:
>
> {field=ABC,memSize=340592,tindexSize=1192,time=1360,phase1=1344,nTerms=7373,bigTerms=1,termInstances=11513,uses=4}
> *item_BCD *:
>
> {field=BCD,memSize=341248,tindexSize=1952,time=1688,phase1=1688,nTerms=8075,bigTerms=0,termInstances=13510,uses=2}
>
> Without understanding these terms i cannot configure server for better
> cache
> usage. The point is searches are very slow. These stats were taken when
> server was down and restarted. I just want to understand what these terms
> mean actually
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-read-SOLR-cache-statistics-tp3907294p3907294.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

question about NRT(soft commit) and Transaction Log in trunk

2012-04-28 Thread Li Li

hi
   I checked out the trunk and played with its new soft commit
feature. it's cool. But I've got a few questions about it.
   By reading some introductory articles and wiki, and hasted code
reading, my understand of it's implementation is:
   For normal commit(hard commit), we should flush all into disk and
commit it. flush is not very time consuming because of
os level cache. the most time consuming one is sync in commit process.
   Soft commit just flush postings and pending deletions into disk
and generating new segments. Then solr can use a
new searcher to read the latest indexes and warm up and then register itself.
   if there is no hard commit and the jvm crashes, then new data may lose.
   if my understanding is correct, then why we need transaction log?
   I found in DirectUpdateHandler2, every time a command is executed,
TransactionLog will record a line in log. But the default
sync level in RunUpdateProcessorFactory is flush, which means it will
not sync the log file. does this make sense?
   in database implementation, we usually write log and modify data
in memory because log is smaller than real data. if crashes.
we can redo the unfinished log and make data correct. will Solr
leverage this log like this? if it is, why it's not synced?

Re: get latest 50 documents the fastest way

2012-05-01 Thread Li Li

you should reverse your sort algorithm. maybe you can override the tf
method of Similarity and return -1.0f * tf(). (I don't know whether
default collector allow score smaller than zero)
Or you can hack this by add a large number or write your own
collector, in its collect(int doc) method, you can do like this:
collect(int doc){
float score=scorer.score();
score*=-1.0f;

}
if you don't sort by relevant score, just set Sort

On Tue, May 1, 2012 at 10:38 PM, Yuval Dotan  wrote:
> Hi Guys
> We have a use case where we need to get the 50 *latest *documents that
> match my query - without additional ranking,sorting,etc on the results.
> My index contains 1,000,000,000 documents and i noticed that if the number
> of found documents is very big (larger than 50% of the index size -
> 500,000,000 docs) than it takes more than 5 seconds to get the results even
> with rows=50 parameter.
> Is there a way to get the results faster?
> Thanks
> Yuval

Re: Sorting result first which come first in sentance

2012-05-03 Thread Li Li

as for version below 4.0, it's not possible because lucene's score
model. position information is stored, but only used to support phrase
query. it just tell us whether a document is matched, but we can boost
a document. The similar problem is : how to implement proximity boost.
for 2 search terms, we need return all docs that contains this 2
terms. but if they are phrase, we give it a largest boost. if there is
a word between them, we give it a smaller one. if there are 2 words
between them, we will give it smaller score. 
all this ranking algorithm need more flexible score model.
I don't know whether the latest trunk take this into consideration.

On Fri, May 4, 2012 at 3:43 AM, Jonty Rhods  wrote:
>> Hi all,
>>
>>
>>
>> I need suggetion:
>>
>>
>>
>> I
>>
>> Hi all,
>>
>>
>>
>> I need suggetion:
>>
>>
>>
>> I have many title like:
>>
>>
>>
>> 1 bomb blast in kabul
>>
>> 2 kabul bomb blast
>>
>> 3 3 people killed in serial bomb blast in kabul
>>
>>
>>
>> I want 2nd result should come first while user search by "kabul".
>>
>> Because kabul is on 1st postion in that sentance.  Similarly 1st result
>> should come on 2nd and 3rd should come last.
>>
>>
>>
>> Please suggest me hot to implement this..
>>
>>
>>
>> Regard
>>
>> Jonty
>>

Re: Sorting result first which come first in sentance

2012-05-03 Thread Li Li

for this version, you may consider using payload for position boost.
you can save boost values in payload.
I have used it in lucene api where anchor text should weigh more than
normal text. but I haven't used it in solr.
some searched urls:
http://wiki.apache.org/solr/Payloads
http://digitalpebble.blogspot.com/2010/08/using-payloads-with-dismaxqparser-in.html


On Fri, May 4, 2012 at 9:51 AM, Jonty Rhods  wrote:
> I am using solr version 3.4

Re: SOLRJ: Is there a way to obtain a quick count of total results for a query

2012-05-04 Thread Li Li

don't score by relevance and score by document id may speed it up a little?
I haven't done any test of this. may be u can give it a try. because
scoring will consume
some cpu time. you just want to match and get total count

On Wed, May 2, 2012 at 11:58 PM, vybe3142  wrote:
> I can achieve this by building a query with start and rows = 0, and using
> .getResults().getNumFound().
>
> Are there any more efficient approaches to this?
>
> Thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLRJ-Is-there-a-way-to-obtain-a-quick-count-of-total-results-for-a-query-tp3955322.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr query with mandatory values

2012-05-09 Thread Li Li

+ before term is correct. in lucene term includes field and value.

Query  ::= ( Clause )*

Clause ::= ["+", "-"] [ ":"] (  | "(" Query ")" )

<#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" | "+" ) >

<#_ESCAPED_CHAR: "\\" ~[] >


in lucene query syntax, you can't express a term value including space.
you can use quotation mark but lucene will take it as a phrase query.
so you need escape space like title:hello\\ world
which will take "hello world" as a field value. and the analyzer then
will tokenize it. so you should use analyzer which can deal with
space. e.g. you can use keyword analyzer

as far as I know

On Thu, May 10, 2012 at 3:35 AM, Matt Kuiper  wrote:
> Yes.
>
> See http://wiki.apache.org/solr/SolrQuerySyntax  - The standard Solr Query 
> Parser syntax is a superset of the Lucene Query Parser syntax.
> Which links to http://lucene.apache.org/core/3_6_0/queryparsersyntax.html
>
> Note - Based on the info on these pages I believe the "+" symbol is to be 
> placed just before the mandatory value, not before the field name in the 
> query.
>
> Matt Kuiper
> Intelligent Software Solutions
>
> -Original Message-
> From: G.Long [mailto:jde...@gmail.com]
> Sent: Wednesday, May 09, 2012 10:45 AM
> To: solr-user@lucene.apache.org
> Subject: Solr query with mandatory values
>
> Hi :)
>
> I remember that in a Lucene query, there is something like mandatory values. 
> I just have to add a "+" symbol in front of the mandatory parameter, like: 
> +myField:my value
>
> I was wondering if there was something similar in Solr queries? Or is this 
> behaviour activated by default?
>
> Gary
>
>

Re: Solr query with mandatory values

2012-05-09 Thread Li Li

some sample codes:

QueryParser parser=new QueryParser(Version.LUCENE_36, "title", 
new
KeywordAnalyzer());

String q="+title:hello\\ world";

Query query=parser.parse(q);

System.out.println(query);

On Thu, May 10, 2012 at 8:20 AM, Li Li  wrote:
> + before term is correct. in lucene term includes field and value.
>
> Query  ::= ( Clause )*
>
> Clause ::= ["+", "-"] [ ":"] (  | "(" Query ")" )
>
> <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" | "+" ) >
>
> <#_ESCAPED_CHAR: "\\" ~[] >
>
>
> in lucene query syntax, you can't express a term value including space.
> you can use quotation mark but lucene will take it as a phrase query.
> so you need escape space like title:hello\\ world
> which will take "hello world" as a field value. and the analyzer then
> will tokenize it. so you should use analyzer which can deal with
> space. e.g. you can use keyword analyzer
>
> as far as I know
>
> On Thu, May 10, 2012 at 3:35 AM, Matt Kuiper  wrote:
>> Yes.
>>
>> See http://wiki.apache.org/solr/SolrQuerySyntax  - The standard Solr Query 
>> Parser syntax is a superset of the Lucene Query Parser syntax.
>> Which links to http://lucene.apache.org/core/3_6_0/queryparsersyntax.html
>>
>> Note - Based on the info on these pages I believe the "+" symbol is to be 
>> placed just before the mandatory value, not before the field name in the 
>> query.
>>
>> Matt Kuiper
>> Intelligent Software Solutions
>>
>> -Original Message-
>> From: G.Long [mailto:jde...@gmail.com]
>> Sent: Wednesday, May 09, 2012 10:45 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr query with mandatory values
>>
>> Hi :)
>>
>> I remember that in a Lucene query, there is something like mandatory values. 
>> I just have to add a "+" symbol in front of the mandatory parameter, like: 
>> +myField:my value
>>
>> I was wondering if there was something similar in Solr queries? Or is this 
>> behaviour activated by default?
>>
>> Gary
>>
>>

Re: How can i search site name

2012-05-21 Thread Li Li

you should define your search first.
if the site is www.google.com. how do you match it. full string
matching or partial matching. e.g. is "google" should match? if it
does, you should write your own analyzer for this field.

On Tue, May 22, 2012 at 2:03 PM, Shameema Umer  wrote:
> Sorry,
> Please let me know how can I search site name using the solr query syntax.
> My results should show title, url and content.
> Title and content are being searched even though the
> content.
>
> I need url or site name too. please, help.
>
> Thanks in advance.
>
> On Tue, May 22, 2012 at 11:05 AM, ketan kore  wrote:
>
>> you can go on www.google.com and just type the site which you want to
>> search and google will show you the results as simple as that ...
>>

Re: Installing Solr on Tomcat using Shell - Code wrong?

2012-05-22 Thread Li Li

you should find some clues from tomcat log
在 2012-5-22 晚上7:49，"Spadez" 写道：

> Hi,
>
> This is the install process I used in my shell script to try and get Tomcat
> running with Solr (debian server):
>
>
>
> I swear this used to work, but currently only Tomcat works. The Solr page
> just comes up with "The requested resource (/solr/admin) is not available."
>
> Can anyone give me some insight into why this isnt working? Its driving me
> nuts.
>
> James
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Installing-Solr-on-Tomcat-using-Shell-Code-wrong-tp3985393.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Li Li

yes, I am also interested in good performance with 2 billion docs. how
many search nodes do you use? what's the average response time and qps
?

another question: where can I find related paper or resources of your
algorithm which explains the algorithm in detail? why it's better than
google site(better than lucene is not very interested because lucene
is not originally designed to provide search function like google)?

On Mon, May 28, 2012 at 1:06 AM, Darren Govoni  wrote:
> I think people on this list would be more interested in your approach to
> scaling 2 billion documents than modifying solr/lucene scoring (which is
> already top notch). So given that, can you share any references or
> otherwise substantiate good performance with 2 billion documents?
>
> Thanks.
>
> On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
>> Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion
>> docs. With RankingAlgorithm 1.4.3, using the parameters
>> age=latest&docs=number feature, you can retrieve the NRT inserted
>> documents in milliseconds from such a huge index improving query and
>> faceting performance and using very little resources ...
>>
>> Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and
>> the NRT insert performance with Solr 4.0 is about 70,000 docs / sec.
>> RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
>>
>> Regards,
>>
>> Nagendra Nagarajayya
>> http://solr-ra.tgels.org
>> http://rankingalgorithm.tgels.org
>>
>>
>>
>> On 5/27/2012 7:32 AM, Darren Govoni wrote:
>> > Hi,
>> >    Have you tested this with a billion documents?
>> >
>> > Darren
>> >
>> > On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
>> >> Hi!
>> >>
>> >> I am very excited to announce the availability of Solr 3.6 with
>> >> RankingAlgorithm 1.4.2.
>> >>
>> >> This NRT supports now works with both RankingAlgorithm and Lucene. The
>> >> insert/update performance should be about 5000 docs in about 490 ms with
>> >> the MbArtists Index.
>> >>
>> >> RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
>> >> over the earlier releases, supports the entire Lucene Query Syntax, ±
>> >> and/or boolean queries and can scale to more than a billion documents.
>> >>
>> >> You can get more information about NRT performance from here:
>> >> http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
>> >>
>> >> You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
>> >> http://solr-ra.tgels.org
>> >>
>> >> Please download and give the new version a try.
>> >>
>> >> Regards,
>> >>
>> >> Nagendra Nagarajayya
>> >> http://solr-ra.tgels.org
>> >> http://rankingalgorithm.tgels.org
>> >>
>> >> ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
>> >> Book
>> >>
>> >
>> >
>> >
>>
>
>

Re: what's better for in memory searching?

2012-06-11 Thread Li Li

I have roughly read the codes of RAMDirectory. it use a list of 1024
byte arrays and many overheads.
But as far as I know, using MMapDirectory, I can't prevent the page
faults. OS will swap less frequent pages out. Even if I allocate
enough memory for JVM, I can guarantee all the files in the directory
are in memory. am I understanding right? if it is, then some less
frequent queries will be slow.  How can I let them always in memory?

On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskog  wrote:
> Yes, use MMapDirectory. It is faster and uses memory more efficiently
> than RAMDirectory. This sounds wrong, but it is true. With
> RAMDirectory, Java has to work harder doing garbage collection.
>
> On Fri, Jun 8, 2012 at 1:30 AM, Li Li  wrote:
>> hi all
>>   I want to use lucene 3.6 providing searching service. my data is
>> not very large, raw data is less that 1GB and I want to use load all
>> indexes into memory. also I need save all indexes into disk
>> persistently.
>>   I originally want to use RAMDirectory. But when I read its javadoc.
>>
>>   Warning: This class is not intended to work with huge indexes.
>> Everything beyond several hundred megabytes
>>  will waste resources (GC cycles), because it uses an internal buffer
>> size of 1024 bytes, producing millions of byte
>>  [1024] arrays. This class is optimized for small memory-resident
>> indexes. It also has bad concurrency on
>>  multithreaded environments.
>> It is recommended to materialize large indexes on disk and use
>> MMapDirectory, which is a high-performance
>>  directory implementation working directly on the file system cache of
>> the operating system, so copying data to
>>  Java heap space is not useful.
>>
>>    should I use MMapDirectory? it seems another contrib instantiated.
>> anyone test it with RAMDirectory?
>
>
>
> --
> Lance Norskog
> goks...@gmail.com

Re: what's better for in memory searching?

2012-06-11 Thread Li Li

1. this setting is global, I just want my lucene searching program
don't swap. for other less important programs, it can still swap.
2. do I need call MappedByteBuffer.load() explicitly? or I have to
warm up the indexes to guarantee all my files are in physical memory?

On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmann  wrote:
> Set the swapiness to 0 to avoid memory pages being swapped to disk too
> early.
>
> http://en.wikipedia.org/wiki/Swappiness
>
> -Kuli
>
> Am 11.06.2012 10:38, schrieb Li Li:
>
>> I have roughly read the codes of RAMDirectory. it use a list of 1024
>> byte arrays and many overheads.
>> But as far as I know, using MMapDirectory, I can't prevent the page
>> faults. OS will swap less frequent pages out. Even if I allocate
>> enough memory for JVM, I can guarantee all the files in the directory
>> are in memory. am I understanding right? if it is, then some less
>> frequent queries will be slow.  How can I let them always in memory?
>>
>> On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskog  wrote:
>>>
>>> Yes, use MMapDirectory. It is faster and uses memory more efficiently
>>> than RAMDirectory. This sounds wrong, but it is true. With
>>> RAMDirectory, Java has to work harder doing garbage collection.
>>>
>>> On Fri, Jun 8, 2012 at 1:30 AM, Li Li  wrote:
>>>>
>>>> hi all
>>>>   I want to use lucene 3.6 providing searching service. my data is
>>>> not very large, raw data is less that 1GB and I want to use load all
>>>> indexes into memory. also I need save all indexes into disk
>>>> persistently.
>>>>   I originally want to use RAMDirectory. But when I read its javadoc.
>>>>
>>>>   Warning: This class is not intended to work with huge indexes.
>>>> Everything beyond several hundred megabytes
>>>>  will waste resources (GC cycles), because it uses an internal buffer
>>>> size of 1024 bytes, producing millions of byte
>>>>  [1024] arrays. This class is optimized for small memory-resident
>>>> indexes. It also has bad concurrency on
>>>>  multithreaded environments.
>>>> It is recommended to materialize large indexes on disk and use
>>>> MMapDirectory, which is a high-performance
>>>>  directory implementation working directly on the file system cache of
>>>> the operating system, so copying data to
>>>>  Java heap space is not useful.
>>>>
>>>>    should I use MMapDirectory? it seems another contrib instantiated.
>>>> anyone test it with RAMDirectory?
>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>
>

Re: what's better for in memory searching?

2012-06-11 Thread Li Li

do you mean software RAM disk? using RAM to simulate disk? How to deal
with Persistence?

maybe I can hack by increase RAMOutputStream.BUFFER_SIZE from 1024 to 1024*1024.
it may have a waste. but I can adjust my merge policy to avoid to much segments.
I will have a "big" segment and a "small" segment. Every night I will
merge them. new added documents will flush into a new segment and I
will merge the new generated segment and the small one.
Our update operations are not very frequent.

On Mon, Jun 11, 2012 at 4:59 PM, Paul Libbrecht  wrote:
> Li Li,
>
> have you considered allocating a RAM-Disk?
> It's not the most flexible thing... but it's certainly close, in performance 
> to a RAMDirectory.
> MMapping on that is likely to be useless but I doubt you can set it to zero.
> That'd need experiment.
>
> Also, doesn't caching and auto-warming provide the lowest latency for all 
> "expected queries" ?
>
> Paul
>
>
> Le 11 juin 2012 à 10:50, Li Li a écrit :
>
>>   I want to use lucene 3.6 providing searching service. my data is
>> not very large, raw data is less that 1GB and I want to use load all
>> indexes into memory. also I need save all indexes into disk
>> persistently.
>>   I originally want to use RAMDirectory. But when I read its javadoc.
>
>

Re: what's better for in memory searching?

2012-06-11 Thread Li Li

I am sorry. I make a mistake. even use RAMDirectory, I can not
guarantee they are not swapped out.

On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmann  wrote:
> Set the swapiness to 0 to avoid memory pages being swapped to disk too
> early.
>
> http://en.wikipedia.org/wiki/Swappiness
>
> -Kuli
>
> Am 11.06.2012 10:38, schrieb Li Li:
>
>> I have roughly read the codes of RAMDirectory. it use a list of 1024
>> byte arrays and many overheads.
>> But as far as I know, using MMapDirectory, I can't prevent the page
>> faults. OS will swap less frequent pages out. Even if I allocate
>> enough memory for JVM, I can guarantee all the files in the directory
>> are in memory. am I understanding right? if it is, then some less
>> frequent queries will be slow.  How can I let them always in memory?
>>
>> On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskog  wrote:
>>>
>>> Yes, use MMapDirectory. It is faster and uses memory more efficiently
>>> than RAMDirectory. This sounds wrong, but it is true. With
>>> RAMDirectory, Java has to work harder doing garbage collection.
>>>
>>> On Fri, Jun 8, 2012 at 1:30 AM, Li Li  wrote:
>>>>
>>>> hi all
>>>>   I want to use lucene 3.6 providing searching service. my data is
>>>> not very large, raw data is less that 1GB and I want to use load all
>>>> indexes into memory. also I need save all indexes into disk
>>>> persistently.
>>>>   I originally want to use RAMDirectory. But when I read its javadoc.
>>>>
>>>>   Warning: This class is not intended to work with huge indexes.
>>>> Everything beyond several hundred megabytes
>>>>  will waste resources (GC cycles), because it uses an internal buffer
>>>> size of 1024 bytes, producing millions of byte
>>>>  [1024] arrays. This class is optimized for small memory-resident
>>>> indexes. It also has bad concurrency on
>>>>  multithreaded environments.
>>>> It is recommended to materialize large indexes on disk and use
>>>> MMapDirectory, which is a high-performance
>>>>  directory implementation working directly on the file system cache of
>>>> the operating system, so copying data to
>>>>  Java heap space is not useful.
>>>>
>>>>    should I use MMapDirectory? it seems another contrib instantiated.
>>>> anyone test it with RAMDirectory?
>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>
>

Re: what's better for in memory searching?

2012-06-11 Thread Li Li

yes, I need average query time less than 10 ms. The faster the better.
I have enough memory for lucene because I know there are not too much
data. there are not many modifications. every day there are about
hundreds of document update. if indexes are not in physical memory,
then IO operations will cost a few ms.
btw, the full gc may also add uncertainty, So I need optimize it as
much as possible.
On Mon, Jun 11, 2012 at 5:27 PM, Michael Kuhlmann  wrote:
> You cannot guarantee this when you're running out of RAM. You'd have a
> problem then anyway.
>
> Why are you caring that much? Did you yet have performance issues? 1GB
> should load really fast, and both auto warming and OS cache should help a
> lot as well. With such an index, you usually don't need to fine tune
> performance that much.
>
> Did you think about using a SSD? Since you want to persist your index,
> you'll need to live with disk IO anyway.
>
> Greetings,
> Kuli
>
> Am 11.06.2012 11:20, schrieb Li Li:
>
>> I am sorry. I make a mistake. even use RAMDirectory, I can not
>> guarantee they are not swapped out.
>>
>> On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmann
>>  wrote:
>>>
>>> Set the swapiness to 0 to avoid memory pages being swapped to disk too
>>> early.
>>>
>>> http://en.wikipedia.org/wiki/Swappiness
>>>
>>> -Kuli
>>>
>>> Am 11.06.2012 10:38, schrieb Li Li:
>>>
>>>> I have roughly read the codes of RAMDirectory. it use a list of 1024
>>>> byte arrays and many overheads.
>>>> But as far as I know, using MMapDirectory, I can't prevent the page
>>>> faults. OS will swap less frequent pages out. Even if I allocate
>>>> enough memory for JVM, I can guarantee all the files in the directory
>>>> are in memory. am I understanding right? if it is, then some less
>>>> frequent queries will be slow.  How can I let them always in memory?
>>>>
>>>> On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskog
>>>>  wrote:
>>>>>
>>>>>
>>>>> Yes, use MMapDirectory. It is faster and uses memory more efficiently
>>>>> than RAMDirectory. This sounds wrong, but it is true. With
>>>>> RAMDirectory, Java has to work harder doing garbage collection.
>>>>>
>>>>> On Fri, Jun 8, 2012 at 1:30 AM, Li Li    wrote:
>>>>>>
>>>>>>
>>>>>> hi all
>>>>>>   I want to use lucene 3.6 providing searching service. my data is
>>>>>> not very large, raw data is less that 1GB and I want to use load all
>>>>>> indexes into memory. also I need save all indexes into disk
>>>>>> persistently.
>>>>>>   I originally want to use RAMDirectory. But when I read its javadoc.
>>>>>>
>>>>>>   Warning: This class is not intended to work with huge indexes.
>>>>>> Everything beyond several hundred megabytes
>>>>>>  will waste resources (GC cycles), because it uses an internal buffer
>>>>>> size of 1024 bytes, producing millions of byte
>>>>>>  [1024] arrays. This class is optimized for small memory-resident
>>>>>> indexes. It also has bad concurrency on
>>>>>>  multithreaded environments.
>>>>>> It is recommended to materialize large indexes on disk and use
>>>>>> MMapDirectory, which is a high-performance
>>>>>>  directory implementation working directly on the file system cache of
>>>>>> the operating system, so copying data to
>>>>>>  Java heap space is not useful.
>>>>>>
>>>>>>    should I use MMapDirectory? it seems another contrib instantiated.
>>>>>> anyone test it with RAMDirectory?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goks...@gmail.com
>>>
>>>
>>>
>

Re: what's better for in memory searching?

2012-06-11 Thread Li Li

I found this. 
http://unix.stackexchange.com/questions/10214/per-process-swapiness-for-linux
it can provide  fine grained control of swapping

On Mon, Jun 11, 2012 at 4:45 PM, Michael Kuhlmann  wrote:
> Set the swapiness to 0 to avoid memory pages being swapped to disk too
> early.
>
> http://en.wikipedia.org/wiki/Swappiness
>
> -Kuli
>
> Am 11.06.2012 10:38, schrieb Li Li:
>
>> I have roughly read the codes of RAMDirectory. it use a list of 1024
>> byte arrays and many overheads.
>> But as far as I know, using MMapDirectory, I can't prevent the page
>> faults. OS will swap less frequent pages out. Even if I allocate
>> enough memory for JVM, I can guarantee all the files in the directory
>> are in memory. am I understanding right? if it is, then some less
>> frequent queries will be slow.  How can I let them always in memory?
>>
>> On Fri, Jun 8, 2012 at 5:53 PM, Lance Norskog  wrote:
>>>
>>> Yes, use MMapDirectory. It is faster and uses memory more efficiently
>>> than RAMDirectory. This sounds wrong, but it is true. With
>>> RAMDirectory, Java has to work harder doing garbage collection.
>>>
>>> On Fri, Jun 8, 2012 at 1:30 AM, Li Li  wrote:
>>>>
>>>> hi all
>>>>   I want to use lucene 3.6 providing searching service. my data is
>>>> not very large, raw data is less that 1GB and I want to use load all
>>>> indexes into memory. also I need save all indexes into disk
>>>> persistently.
>>>>   I originally want to use RAMDirectory. But when I read its javadoc.
>>>>
>>>>   Warning: This class is not intended to work with huge indexes.
>>>> Everything beyond several hundred megabytes
>>>>  will waste resources (GC cycles), because it uses an internal buffer
>>>> size of 1024 bytes, producing millions of byte
>>>>  [1024] arrays. This class is optimized for small memory-resident
>>>> indexes. It also has bad concurrency on
>>>>  multithreaded environments.
>>>> It is recommended to materialize large indexes on disk and use
>>>> MMapDirectory, which is a high-performance
>>>>  directory implementation working directly on the file system cache of
>>>> the operating system, so copying data to
>>>>  Java heap space is not useful.
>>>>
>>>>    should I use MMapDirectory? it seems another contrib instantiated.
>>>> anyone test it with RAMDirectory?
>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>
>

Re: what's better for in memory searching?

2012-06-11 Thread Li Li

is this method equivalent to set vm.swappiness which is global?
or it can set the swappiness for jvm process?

On Tue, Jun 12, 2012 at 5:11 AM, Mikhail Khludnev
 wrote:
> Point about premature optimization makes sense for me. However some time
> ago I've bookmarked potentially useful approach
> http://lucene.472066.n3.nabble.com/High-response-time-after-being-idle-tp3616599p3617604.html.
>
> On Mon, Jun 11, 2012 at 3:02 PM, Toke Eskildsen 
> wrote:
>
>> On Mon, 2012-06-11 at 11:38 +0200, Li Li wrote:
>> > yes, I need average query time less than 10 ms. The faster the better.
>> > I have enough memory for lucene because I know there are not too much
>> > data. there are not many modifications. every day there are about
>> > hundreds of document update. if indexes are not in physical memory,
>> > then IO operations will cost a few ms.
>>
>> I'm with Michael on this one: It seems that you're doing a premature
>> optimization. Guessing that your final index will be < 5GB in size with
>> 1 million documents (give or take 900.000:-), relatively simple queries
>> and so on, an average response time of 10 ms should be attainable even
>> on spinning drives. One hundred document updates per day are not many,
>> so again I would not expect problems.
>>
>> As is often the case on this mailing list, the advice is "try it". Using
>> a normal on-disk index and doing some warm up is the easy solution to
>> implement and nearly all of your work on this will be usable for a
>> RAM-based solution, if you are not satisfied with the speed. Or you
>> could buy a small & cheap SSD and have no more worries...
>>
>> Regards,
>> Toke Eskildsen
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>

strange problem

2010-11-15 Thread Li Li

hi all
   I confronted a strange problem when feed data to solr. I started
feeding and then Ctrl+C to kill feed program(post.jar). Then because
XML stream is terminated unnormally, DirectUpdateHandler2 will throw
an exception. And I goto the index directory and sorted it by date.
newest files are fdt and fdx. That's right, because index files such
as tii tis are now in memory. Then I shutdown tomcat by
./bin/shutdown.sh. When I restarted tomcat, The index is corrupted.
when I searched, the docID don't exist, An exception in TermScorer
return norms == null ? raw : raw * SIM_NORM_DECODER[norms[doc] & 0xFF]
will be throwed because doc is out of range of norms[]. it also be
throwed in SegmentTermDocs
  public int read(final int[] docs, final int[] freqs)
  if (deletedDocs == null || !deletedDocs.get(doc))   get(doc) also
out of range of deletedDocs
  It seems that index file are not correct so when docId is decoded,
it's not a correct one.

  Another phenomenon, When I Ctrl+C post.jar, and I feed some other
data to let it commit successfully. it's ok
  It seems the buffed index in memory will flush to disk when tomcat
is shutdown. the flushed file is not corrected.
  Anyone confront this problem before?

Re: shutdown.sh does not kill the tomcat process running solr./?

2010-11-30 Thread Li Li

1. make sure the  the port is not used.
2. ./bin/shutdown.sh && tail -f logs/xxx to see what the server is doing
if you just feed data or modified index, and don't flush/commit,
when shutdowning, it will do something.

2010/12/1 Robert Petersen :
> Greetings, we're wondering why we can issue the command to shutdown
> tomcat/solr but the process remains visible in memory (by using the top
> command) and we have to manually kill the PID for it to release its
> memory before we can (re)start tomcat/solr?  Anybody have any ideas?
> The process is using 12+ GB main memory typically but can go up to 40 GB
> on the master where we index.  We have 64GB main memory on these
> servers.  I set the heap at 12 GB and use the concurrent garbage
> collector too.
>
>
>
> That raises another question:  top can show only 20 GB free out of 64
> but the tomcat/solr process only shows its using half of that.  What is
> using the rest?  The numbers don't add up...
>
>
>
> Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat
>
> Platform: RHEL with Sun JRE 1.6.0_18
>
>

Re: Best practice for Delta every 2 Minutes.

2010-11-30 Thread Li Li

you may implement your own MergePolicy to keep on large index and
merge all other small ones
or simply set merge factor to 2 and the largest index not be merged by
set maxMergeDocs less than the docs in the largest one.
So there is one large index and a small one. when adding a little
docs, they will be merged into the small one. and you can, e.g. weekly
optimize the index and merge all indice into one index.

2010/11/30 stockii :
>
> Hello.
>
> index is about 28 Million documents large. When i starts an delta-import is
> look at modified. but delta import takes to long. over an hour need solr for
> delta.
>
> thats my query. all sessions from the last hour should updated and all
> changed. i think its normal that solr need long time for the querys. how can
> i optimize this ?
>
> deltaQuery="SELECT id FROM sessions
> WHERE created BETWEEN DATE_ADD( NOW(), INTERVAL - 10 HOUR ) AND NOW()
> OR modified BETWEEN '${dataimporter.last_index_time}' AND DATE_ADD( NOW(),
> INTERVAL - 1 HOUR  ) "
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992714.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Best practice for Delta every 2 Minutes.

2010-12-16 Thread Li Li

I think it will not because default configuration can only have 2
newSearcher threads but the delay will be more and more long. The
newer newSearcher will wait these 2 ealier one to finish.

2010/12/1 Jonathan Rochkind :
> If your index warmings take longer than two minutes, but you're doing a
> commit every two minutes -- you're going to run into trouble with
> overlapping index preperations, eventually leading to an OOM.  Could this be
> it?
>
> On 11/30/2010 11:36 AM, Erick Erickson wrote:
>>
>> I don't know, you'll have to debug it to see if it's the thing that takes
>> so
>> long. Solr
>> should be able to handle 1,200 updates in a very short time unless there's
>> something
>> else going on, like you're committing after every update or something.
>>
>> This may help you track down performance with DIH
>>
>> http://wiki.apache.org/solr/DataImportHandler#interactive
>>
>> Best
>> Erick
>>
>> On Tue, Nov 30, 2010 at 9:01 AM, stockii  wrote:
>>
>>> how do you think is the deltaQuery better ? XD
>>> --
>>> View this message in context:
>>>
>>> http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>

Re: Best practice for Delta every 2 Minutes.

2010-12-16 Thread Li Li

we now meet the same situation and want to implement like this:
we add new documents to a RAMDirectory and search two indice-- the
index in disk and the RAM index.
regularly(e.g. every hour we flush the RAMDirecotry into disk and make
a new segment)
to prevent error. before add to RAMDirecotry,we write the document
into log file.
and after flushing, we delete corresponding lines in the log file
if the program corrput. we will redo the log and add them into RAMDirectory.
Any one has done similar work?

2010/12/1 Li Li :
> you may implement your own MergePolicy to keep on large index and
> merge all other small ones
> or simply set merge factor to 2 and the largest index not be merged by
> set maxMergeDocs less than the docs in the largest one.
> So there is one large index and a small one. when adding a little
> docs, they will be merged into the small one. and you can, e.g. weekly
> optimize the index and merge all indice into one index.
>
> 2010/11/30 stockii :
>>
>> Hello.
>>
>> index is about 28 Million documents large. When i starts an delta-import is
>> look at modified. but delta import takes to long. over an hour need solr for
>> delta.
>>
>> thats my query. all sessions from the last hour should updated and all
>> changed. i think its normal that solr need long time for the querys. how can
>> i optimize this ?
>>
>> deltaQuery="SELECT id FROM sessions
>> WHERE created BETWEEN DATE_ADD( NOW(), INTERVAL - 10 HOUR ) AND NOW()
>> OR modified BETWEEN '${dataimporter.last_index_time}' AND DATE_ADD( NOW(),
>> INTERVAL - 1 HOUR  ) "
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992714.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>

Re: Optimizing to only 1 segment

2010-12-26 Thread Li Li

see maxMergeDocs(maxMergeSize) in solrconfig.xml. if the segment's
documents size is larger than this value, it will not be merged.

2010/12/27 Rok Rejc :
> Hi all,
>
> I have created an index, commited the data and after that I had run the
> optimize with default parameters:
>
> http://localhost:8080/myindex/update?stream.body=
>
> I was suprised that after the optimizing was finished there was 21 segments
> in the index:
>
> reader : 
> SolrIndexReader{this=724a2dd4,r=directoryrea...@724a2dd4,refCnt=1,segments=21}
>
>
> If I understand the documentation correctly the default maxSegments is 1.
> Nothing helps even if I put the parameter in the command:
>
> http://localhost:8080/myindex/update?maxSegments=1&stream.body=
>
> Am I missing something?
>
> I am using the trunk version with SimpleFSDirectory
>
> Thanks, Rok
>

Re: Optimizing to only 1 segment

2010-12-26 Thread Li Li

maybe you can consult log files and it may show you something
btw how do you post your command?
do you use curl 'http://localhost:8983/solr/update?optimize=true' ?
or posting a xml file?

2010/12/27 Rok Rejc :
> On Mon, Dec 27, 2010 at 3:26 AM, Li Li  wrote:
>
>> see maxMergeDocs(maxMergeSize) in solrconfig.xml. if the segment's
>> documents size is larger than this value, it will not be merged.
>>
>
> I see that in my solrconfig.xml, but it is commented and marked as
> deprecated. I have uncommented this setting (so the value was 2147483647) un
> run the optmize again but it finished immediately and still left 21
> segments.
>
> Btw mergeFactor is set to 20, maxDoc is 121490241, the index will be
> "read-only".
>
> Thanks.
>

Re: Optimizing to only 1 segment

2010-12-27 Thread Li Li

oh, you mean lucene 4 trunk.
 LogByteSizeMergePolicy's default size is 2048MB(2GB)
 I did like this
 LogByteSizeMergePolicy  mp=new LogByteSizeMergePolicy();
 mp.setMaxMergeMB(100);


2010/12/27 Rok Rejc :
> Okej the same thing happens if i run optimize in java:
>
>        File file = new File("e:\\myIndex\\index");
>        Directory directory = FSDirectory.open(file);
>
>        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
>        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
> analyzer);
>        IndexWriter writer = new IndexWriter(directory, config);
>        writer.optimize(1, true);
>        writer.close();
>        System.out.println("Finished");
>
> The optmize finishes immediately but there are still 21 segments after the
> optimize?
>
> Any clue what else should I set?
>
>
> On Mon, Dec 27, 2010 at 8:08 AM, Rok Rejc  wrote:
>
>> Hi, there is nothing in the log, and the optimize finishes successfully:
>>
>> 
>> 
>> 0
>> 17
>> 
>> 
>>
>> I run optmize through browser by entering url
>>
>> http://localhost:8080/myindex/update?optimize=true
>> or
>> http://localhost:8080/myindex/update?stream.body=
>>
>> Thanks.
>>
>>
>> On Mon, Dec 27, 2010 at 7:12 AM, Li Li  wrote:
>>
>>> maybe you can consult log files and it may show you something
>>> btw how do you post your command?
>>> do you use curl 'http://localhost:8983/solr/update?optimize=true' ?
>>> or posting a xml file?
>>>
>>> 2010/12/27 Rok Rejc :
>>> > On Mon, Dec 27, 2010 at 3:26 AM, Li Li  wrote:
>>> >
>>> >> see maxMergeDocs(maxMergeSize) in solrconfig.xml. if the segment's
>>> >> documents size is larger than this value, it will not be merged.
>>> >>
>>> >
>>> > I see that in my solrconfig.xml, but it is commented and marked as
>>> > deprecated. I have uncommented this setting (so the value was
>>> 2147483647) un
>>> > run the optmize again but it finished immediately and still left 21
>>> > segments.
>>> >
>>> > Btw mergeFactor is set to 20, maxDoc is 121490241, the index will be
>>> > "read-only".
>>> >
>>> > Thanks.
>>> >
>>>
>>
>>
>

Re: Turn off caching

2011-02-10 Thread Li Li

do you mean queryResultCache? you can comment related paragraph in
solrconfig.xml
see http://wiki.apache.org/solr/SolrCaching

2011/2/8 Isan Fulia :
> Hi,
> My solrConfig file looks like
>
> 
>  
>
>  
>     multipartUploadLimitInKB="2048" />
>  
>
>   default="true" />
>  
>   class="org.apache.solr.handler.admin.AdminHandlers" />
>
>
>   class="org.apache.solr.request.XSLTResponseWriter">
>  
>  
>  
>    *:*
>  
> 
>
>
> EveryTime I fire the same query so as to compare the results for different
> configurations , the query result time is getting reduced because of
> caching.
> So I want to turn off the cahing or clear the ache before  i fire the same
> query .
> Does anyone know how to do it.
>
>
> --
> Thanks & Regards,
> Isan Fulia.
>

Re: Turn off caching

2011-02-11 Thread Li Li

besides fieldCache, there is also a cache for termInfo.
I don't know how to turn it off in both lucene and solr.

codes in TermInfosReader
  /** Returns the TermInfo for a Term in the set, or null. */
  TermInfo get(Term term) throws IOException {
return get(term, true);
  }

  /** Returns the TermInfo for a Term in the set, or null. */
  private TermInfo get(Term term, boolean useCache) throws IOException

2011/2/11 Stijn Vanhoorelbeke :
> Hi,
>
> You can comment out all sections in solrconfig.xml pointing to a cache.
> However, there is a cache deep in Lucence - the fieldcache - that can't be
> commented out. This cache will always jump into the picture
>
> If I need to do such things, I restart the whole tomcat6 server to flush ALL
> caches.
>
> 2011/2/11 Li Li 
>
>> do you mean queryResultCache? you can comment related paragraph in
>> solrconfig.xml
>> see http://wiki.apache.org/solr/SolrCaching
>>
>> 2011/2/8 Isan Fulia :
>> > Hi,
>> > My solrConfig file looks like
>> >
>> > 
>> >  
>> >
>> >  
>> >    > > multipartUploadLimitInKB="2048" />
>> >  
>> >
>> >  > > default="true" />
>> >  
>> >  > > class="org.apache.solr.handler.admin.AdminHandlers" />
>> >
>> >
>> >  > > class="org.apache.solr.request.XSLTResponseWriter">
>> >  
>> >  
>> >  
>> >    *:*
>> >  
>> > 
>> >
>> >
>> > EveryTime I fire the same query so as to compare the results for
>> different
>> > configurations , the query result time is getting reduced because of
>> > caching.
>> > So I want to turn off the cahing or clear the ache before  i fire the
>> same
>> > query .
>> > Does anyone know how to do it.
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Isan Fulia.
>> >
>>
>

I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li

hi
it seems my mail is judged as spam.
Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other server
returned was: 552 552 spam score (5.1) exceeded threshold
(FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
(state 18).

Re: I send a email to lucene-dev solr-dev lucene-user but always failed

2011-03-11 Thread Li Li

Problem of Replication Reservation Durationhi all,
I tried to send this mail to solr dev mail list but it tells me this is
a spam. So I send it again and to lucene dev too.
The replication handler in solr 1.4 which we used seems to be a little
problematic in some extreme situation.
The default reserve duration is 10s and can't modified by any method.
  private Integer reserveCommitDuration =
SnapPuller.readInterval("00:00:10");
The current implementation is: slave send a http
request(CMD_GET_FILE_LIST) to ask server list current index files.
In the response codes of master, it will reserve this commit for 10s.
  // reserve the indexcommit for sometime
  core.getDeletionPolicy().setReserveDuration(version,
reserveCommitDuration);
   If the master's indexes are changed within 10s, the old version will not
be deleted. Otherwise, the old version will be deleted.
slave then get the files in the list one by one.
considering the following situation.
Every mid-night we optimize the whole indexes into one single index, and
every 15 minutes, we add new segments to it.
e.g. when the slave copy the large optimized indexes, it will cost more
than 15 minutes. So it will fail to copy all files and
retry 5 minutes later. But each time it will re-copy all the files into a
new tmp directory. it will fail again and again as long as
we update indexes within 15 minutes.
we can tack this problem by setting reserveCommitDuration to 20 minutes.
But then because we update small number of
documents very frequently, many useless indexes will be reserved and it's a
waste of disk space.
Any one confronted the problem before and is there any solution for it?
We comes up a ugly solution like this: slave fetches files using
multithreads. each file a thread. Thus master will open all the
files that slave needs. As long as the file is opened. when master want to
delete them, these files will be deleted. But the inode
reference count is larger than 0.  Because reading too many files by master
will decrease the ability of master. we want to use
some synchronization mechanism to allow only 1 or 2 ReplicationHandler
threads are doing CMD_GET_FILE command.
Is that solution feasible?

2011/3/11 Li Li 

> hi
> it seems my mail is judged as spam.
> Technical details of permanent failure:
> Google tried to deliver your message, but it was rejected by the recipient
> domain. We recommend contacting the other email provider for further
> information about the cause of this error. The error that the other server
> returned was: 552 552 spam score (5.1) exceeded threshold
> (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> (state 18).

Re: What request handlers to use for query strings in Chinese or Japanese?

2011-03-17 Thread Li Li

That's the job your analyzer should concern


2011/3/17 Andy :
> Hi,
>
> For my Solr server, some of the query strings will be in Asian languages such 
> as Chinese or Japanese.
>
> For such query strings, would the Standard or Dismax request handler work? My 
> understanding is that both the Standard and the Dismax handler tokenize the 
> query string by whitespace. And that wouldn't work for Chinese or Japanese, 
> right?
>
> In that case, what request handler should I use? And if I need to set up 
> custom request handlers for those languages, how do I do it?
>
> Thanks.
>
> Andy
>
>
>
>

Re: Helpful new JVM parameters

2011-03-17 Thread Li Li

will UseCompressedOops be useful? for application using less than 4GB
memory, it will be better that 64bit reference. But for larger memory
using application, it will not be cache friendly.
"JRocket the definite guide" says: "Naturally, 64 GB isn't a
theoretical limit but just an example. It was mentioned because
compressed references on 64-GB heaps have proven beneficial compared
to full 64-bit pointers in some benchmarks and applications. What
really matters, is how many bits can be spared and the performance
benefit of this approach. In some cases, it might just be easier to
use full length 64-bit pointers."

2011/3/18 Dyer, James :
> We're on the final stretch in getting our product database in Production with 
> Solr.  We have 13m "wide-ish" records with quite a few stored fields in a 
> single index (no shards).  We sort on at least a dozen fields and facet on 
> 20-30.  One thing that came up in QA testing is we were getting full gc's due 
> to "promotion failed" conditions.  This led us to believe we were dealing 
> with large objects being created and a fragmented old generation.  After 
> improving, but not solving, the problem by tweaking "conventional" jvm 
> parameters, our JVM expert learned about some newer tuning params included in 
> Sun/Oracle's JDK 1.6.0_24 (we're running RHEL x64, but I think these are 
> available on other platforms too):
>
> These 3 options dramatically reduced the # objects getting promoted into the 
> Old Gen, reducing fragmentation and CMS frequency & time:
> -XX:+UseStringCache
> -XX:+OptimizeStringConcat
> -XX:+UseCompressedStrings
>
> This uses compressed pointers on a 64-bit JVM, significantly reducing the 
> memory & performance penalty in using a 64-bit jvm over 32-bit.  This reduced 
> our new GC (ParNew) time significantly:
> -XX:+UseCompressedOops
>
> The default for this was causing CMS to begin too late sometimes.  (the 
> documentated 68% proved false in our case.  We figured it was defaulting 
> close to 90%)  Much lower than 75%, though, and CMS ran far too often:
> -XX:CMSInitiatingOccupancyFraction=75
>
> This made the "stop-the-world" pauses during CMS much shorter:
> -XX:+CMSParallelRemarkEnabled
>
> We use these in conjunction with CMS/ParNew and a 22gb heap (64gb total on 
> the box), with a 1.2G newSize/maxNewSize.
>
> In case anyone else is having similar issues, we thought we would share our 
> experience with these newer options.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>

Re: Snappull failed

2011-03-19 Thread Li Li

has master updated index during replication?
this could occur when it failed to download any file becuase network problem.
209715200!=583644834 means the size of the file slave fetched is  583644834
but it only download  209715200 bytes. maybe the connection is time out.

2011/2/16 Markus Jelsma :
> Hi,
>
> There are a couple of Solr 1.4.1 slaves, all doing the same. Pulling some
> snaps, handling some queries, nothing exciting. But can anyone explain a
> sudden nightly occurence of this error?
>
> 2011-02-16 01:23:04,527 ERROR [solr.handler.ReplicationHandler] - [pool-238-
> thread-1] - : SnapPull failed
> org.apache.solr.common.SolrException: Unable to download _gv.frq completely.
> Downloaded 209715200!=583644834
>        at
> org.apache.solr.handler.SnapPuller$FileFetcher.cleanup(SnapPuller.java:1026)
>        at
> org.apache.solr.handler.SnapPuller$FileFetcher.fetchFile(SnapPuller.java:906)
>        at
> org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:541)
>        at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:294)
>        at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
>        at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>        at
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
>        at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:619)
>
> All i know is that it was unable to download but the reason eludes me.
> Sometimes, a machine rolls out many of these errors and increasing the index
> size because it can't handle the already downloaded data.
>
> Cheers,
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: RamBufferSize and AutoCommit

2011-03-28 Thread Li Li

there are 3 conditions that will trigger an auto flushing in lucene
1. size of index in ram is larger than ram buffer size
2. documents in mamory is larger than the number set by setMaxBufferedDocs.
3. deleted term number is larger than the ratio set by
setMaxBufferedDeleteTerms.

auto flushing by time interval is added by solr

rambufferSize  will use estimated size and the real used memory may be
larger than this value. So if  your Xmx is 2700m, setRAMBufferSizeMB.
should set value less than it. if you setRAMBufferSizeMB to 2700m and
the other 3 conditions are not
triggered, I think it will hit OOM exception.

2011/3/28 Isan Fulia :
> Hi all ,
>
> I would like to know is there any relation between autocommit and
> rambuffersize.
> My solr config does not  contain rambuffersize which mean its
> deault(32mb).Autocommit setting are after 500 docs or 80 sec
> whichever is first.
> Solr starts with Xmx 2700M .Total Ram is 4 GB.
> Does the rambufferSize is alloted outside the heap memory(2700M)?
> How does rambuffersize is related to out of memory errors.
> What is the optimal value for rambuffersize.
>
> --
> Thanks & Regards,
> Isan Fulia.
>

1 2 3 >

1 - 100 of 284 matches

Mail list logo