date:20110411

Hi All,

I have the same issue. I have installed solr instance on tomcat6. When try
to index pdf I am running into the below exception:

11 Apr, 2011 12:11:55 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NoClassDefFoundError:
org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ClassNotFoundException:
org.apache.tika.exception.TikaException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more

I could not found any tika jar file.
Could you please help me out in fixing the above issue.

Thanks,
Mike

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805615.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.1 performance compared to 1.4.1

2011-04-11 Thread Marius van Zwijndregt

Hi Yonik !

Thanks for your reply.

I decided to switch to 3.1 and see if the performance would settle down
after building up a proper index. Looking at the average response time from
both installations i can see that 3.1 is now actually performing much better
than 1.4.1 (1.4.1 shows an average of 43ms, 3.1 shows 32ms)

My earlier test (with new keywords) now shows that 3.1 also outperforms
1.4.1 with keywords which have not yet been queried.

For the record, the tests are ran on ubuntu 10.04 (8GB ram, Quad core,
software raid 1). Ive given both installations a jvm with 1GB of ram. Ive
unpacked a new installation of 3.1 besides 1.4.1, and copied in the (in my
case) missing parts of configuration (dataimporter, sql xml config and
schema additions).

Cheers !

Marius

2011/4/10 Yonik Seeley 

> On Fri, Apr 8, 2011 at 9:53 AM, Marius van Zwijndregt
>  wrote:
> > Hello !
> >
> > I'm new to the list, have been using SOLR for roughly 6 months and love
> it.
> >
> > Currently i'm setting up a 3.1 installation, next to a 1.4.1 installation
> > (Ubuntu server, same JVM params). I have copied the configuration from
> 1.4.1
> > to the 3.1.
> > Both version are running fine, but one thing ive noticed, is that the
> QTime
> > on 3.1, is much slower for initial searches than on the (currently
> > production) 1.4.1 installation.
> >
> > For example:
> >
> > Searching with 3.1; http://mysite:9983/solr/select?q=grasmaaier: QTime
> > returns 371
> > Searching with 1.4.1: http://mysite:8983/solr/select?q=grasmaaier: QTime
> > returns 59
> >
> > Using debugQuery=true, i can see that the main time is spend in the query
> > component itself (org.apache.solr.handler.component.QueryComponent).
> >
> > Can someone explain this, and how can i analyze this further ? Does it
> take
> > time to build up a decent query, so could i switch to 3.1 without having
> to
> > worry ?
>
> Thanks for the report... there's no reason that anything should really
> be much slower, so it would be great to get to the bottom of this!
>
> Is this using the same index as the 1.4.1 server, or did you rebuild it?
>
> Are there any other query parameters (that are perhaps added by
> default, like faceting or anything else that could take up time) or is
> this truly just a term query?
>
> What platform are you on?  I believe the Lucene Directory
> implementation now tries to be smarter (compared to lucene 2.9) about
> picking the best default (but it may not be working out for you for
> some reason).
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>

Re: How to index PDF file stored in SQL Server 2008

2011-04-11 Thread Roy Liu

I changed data-config-sql.xml to

  

  





  



There are no errors, but, the indexed pdf is convert to Numbers..
200 1 202 1 203 1 212 1 222 1 236 1 242 1 244 1 254 1 255
-- 
Best Regards,
Roy Liu


On Mon, Apr 11, 2011 at 2:02 PM, Roy Liu  wrote:

> Hi, all
> Thank YOU very much for your kindly help.
>
> *1. I have upgrade from Solr 1.4 to Solr 3.1*
> *2. Change data-config-sql.xml *
>
> 
>  name="*bsds*"
>   driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>
> url="jdbc:sqlserver://localhost:1433;databaseName=bs_docmanager"
>   user="username"
>   password="pw"/>
>   
>
>   
>  query="select id,attachment,filename from attachment where
> ext='pdf' and id>30001030" >
>
> 
> * url="${doc.attachment}" format="text" >**
> 
> *
> 
> 
>   
> 
>
> *3. solrconfig.xml and schema.xml are NOT changed.*
>
> However, when I access
>
> *http://localhost:8080/solr/dataimport?command=full-import*
>
> It still has errors:
> Full Import
> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to execute query:[B@ae1393 Processing Document # 1
>
> Could you give me some advices. This problem is so boring me.
> Thanks.
>
> --
> Best Regards,
> Roy Liu
>
>
>
> On Mon, Apr 11, 2011 at 5:16 AM, Lance Norskog  wrote:
>
>> You have to upgrade completely to the Apache Solr 3.1 release. It is
>> worth the effort. You cannot copy any jars between Solr releases.
>> Also, you cannot copy over jars from newer Tika releases.
>>
>> On Fri, Apr 8, 2011 at 10:47 AM, Darx Oman  wrote:
>> > Hi again
>> > what you are missing is field mapping
>> > 
>> > 
>> >
>> >
>> > no need for TikaEntityProcessor  since you are not accessing pdf files
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>
>

Re: Tika, Solr running under Tomcat 6 on Debian

2011-04-11 Thread Roy Liu

\apache-solr-3.1.0\contrib\extraction\lib\tika*.jar

-- 
Best Regards,
Roy Liu


On Mon, Apr 11, 2011 at 3:10 PM, Mike  wrote:

> Hi All,
>
> I have the same issue. I have installed solr instance on tomcat6. When try
> to index pdf I am running into the below exception:
>
> 11 Apr, 2011 12:11:55 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.NoClassDefFoundError:
> org/apache/tika/exception/TikaException
>at java.lang.Class.forName0(Native Method)
>at java.lang.Class.forName(Class.java:247)
>at
>
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)
>at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
>at
> org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
>at
>
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)
>at
>
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>at
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>at
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>at
>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>at
>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>at
>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>at
>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>at
>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>at
>
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.tika.exception.TikaException
>at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
>... 22 more
>
> I could not found any tika jar file.
> Could you please help me out in fixing the above issue.
>
> Thanks,
> Mike
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805615.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread karsten-solr

Hi Lance,

your are right:
XPathEntityProcessor has the attribut "xsl", so I can use xslt to generate a 
xml-File "in the form of the standard Solr update schema".
I will check the performance of this.


Best regards
  Karsten


btw. "flatten" is an attribute of the "field"-Tag, not of XPathEntityProcessor 
(like wrongly specified it the wiki)


 Lance
> There is an option somewhere to use the full XML DOM implementation
> for using xpaths. The purpose of the XPathEP is to be as simple and
> dumb as possible and handle most cases: RSS feeds and other open
> standards.
> 
> Search for xsl(optional)
> 
> http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
> 
 Karsten
> On Sat, Apr 9, 2011 at 5:32 AM
> > Hi Folks,
> >
> > does anyone improve DIH XPathRecordReader to deal with nested xpaths?
> > e.g.
> > data-config.xml with
> >   >  
> >  
> > and the XML stream contains
> >  /html/body/h1...
> > will only fill field “alltext” but field “title” will be empty.
> >
> > This is a known issue from 2009
> >
> https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
> >
> > So three questions:
> > 1. How to fill a “search over all”-Field without nested xpaths?
> >   (schema.xml   will not help,
> because we lose the original token order)
> > 2. Does anyone try to improve XPathRecordReader to deal with nested
> xpaths?
> > 3. Does anyone else need this feature?
> >
> >
> > Best regards
> >  Karsten
> >
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

RE: Solr under Tomcat

Hi All,

I have installed solr instance on tomcat6. When i tried to index the PDF
file i was able to see the response:


0
479


Query:
http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdf&stream.contentType=application/pdf&literal.id=Struts%202%20Design%20and%20Programming1.pdf&defaultField=text&commit=true

But when i tried to search the content in the pdf i could not get any
results:



0
2
−

on
0
struts
10
2.2




 
Could you please let me know if I am doing anything wrong. It works fine
when i tried with default jetty server prior to integrating on the tomcat6.

I have followed installation steps from
http://wiki.apache.org/solr/SolrTomcat
(Tomcat on Windows Single Solr app).

Thanks,
Mike



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-under-Tomcat-tp2613501p2805970.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tika, Solr running under Tomcat 6 on Debian

Hi Roy,

Thank you for the quick reply. When i tried to index the PDF file i was able
to see the response:


0
479



Query:
http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdf&stream.contentType=application/pdf&literal.id=Struts%202%20Design%20and%20Programming1.pdf&defaultField=text&commit=true

But when i tried to search the content in the pdf i could not get any
results:



0
2
−

on
0
struts
10
2.2




 
Could you please let me know if I am doing anything wrong. It works fine
when i tried with default jetty server prior to integrating on the tomcat6.

I have followed installation steps from
http://wiki.apache.org/solr/SolrTomcat
(Tomcat on Windows Single Solr app).

Thanks,
Mike



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805974.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Gary Taylor

Jayendra,

Thanks for the info - been keeping an eye on this list in case this
topic cropped up again. It's currently a background task for me, so
I'll try and take a look at the patches and re-test soon.

Joey - glad you brought this issue up again. I haven't progressed any
further with it. I've not yet moved to Solr 3.1 but it's on my to-do
list, as is testing out the patches referenced by Jayendra. I'll post
my findings on this thread - if you manage to test the patches before
me, let me know how you get on.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats. I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results. Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl "
http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true";
-H "application/octet-stream" -F "myfile=@data.zip"

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.

--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Spellchecker with synonyms

2011-04-11 Thread royr

Hello,

I have some synonyms for city names. Sometimes there are multiple names for
one city, example:.

newyork, newyork city, big apple

I search for "big apple" and get results with new york(synonym)
If somebody search for "big aple" i want a spelling suggestion like: big
apple. How can i fix that synonyms
are available for the spellchecker?









--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806028.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Spellchecker with synonyms

2011-04-11 Thread lboutros

Did you configure synonyms for your field at query time ?

Ludovic.

2011/4/11 royr [via Lucene] 

> Hello,
>
> I have some synonyms for city names. Sometimes there are multiple names for
> one city, example:.
>
> newyork, newyork city, big apple
>
> I search for "big apple" and get results with new york(synonym)
> If somebody search for "big aple" i want a spelling suggestion like: big
> apple. How can i fix that synonyms
> are available for the spellchecker?
>
>
>
>
>
>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806028.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here.
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806113.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Spellchecker with synonyms

2011-04-11 Thread royr

Yes, it looks like this:


  
   
   
   
   
   
  


 will work on query and index time i think.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806157.html
Sent from the Solr - User mailing list archive at Nabble.com.

XML not coming through from nabble to Gmail

2011-04-11 Thread Erick Erickson

All:

Lately I've been seeing a lot of posts where people paste in parts of their
schema.xml or solrconfig.xml and the results are...er...disappointing. None
of the less-than or greater-than symbols show and the formatting is all over
the map.

Since some mails would come through with the XML formatted and some would be
wonky, at first I thought it was the sender, but then a pretty high
percentage came through this way. So I poked around and it seems to only be
the case that the XML is "wonkified" (tm) when it's comes to Gmail from
nabble, the original post on nabble has the markup and displays fine.
Behavior is the same in Chrome and Firefox BTW.

Does anyone have any insight into this? Time to complain to the nabble
folks? Do others see this with non-Gmail relays?

Thanks,
Erick

Can I set up a config-based distributed search

2011-04-11 Thread Ran Peled

In the Distributed Search page (
http://wiki.apache.org/solr/DistributedSearch), it is documented that in
order to perform a distributed search over a sharded index, I should use the
"shards" request parameter, listing the shards to participate in the search
(e.g. ?shards=localhost:8983/solr,localhost:7574/solr).   I am planning a
new pretty large index (1B+ items).  Say I have a 100 shards, specifying the
shards on the request URL becomes unrealistic due to length of URL.  It is
also redundant to do that on every request.

Is there a way to specify the list of shards in a configuration file,
instead of on the query URL?  I have seen references to relevant config in
SolrCloud, but as I understand it planned to be released only in Solr 4.0.

Thanks,
Ran

Re: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Michael McCandless

Tom,

I think I see where this may be -- it looks like another > 2B terms
bug in Lucene (we are using an int instead of a long in the
TermInfoAndOrd class inside TermInfosReader.java), only present in
3.1.

I'm also mad that Test2BTerms fails to catch this!!  I will go fix
that test and confirm it sees this bug.

Can you build from source?  If so, try this patch:

Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
===
--- lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(revision
1089906)
+++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(working copy)
@@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
-final int termOrd;
-public TermInfoAndOrd(TermInfo ti, int termOrd) {
+final long termOrd;
+public TermInfoAndOrd(TermInfo ti, long termOrd) {
   super(ti);
   this.termOrd = termOrd;
 }
@@ -245,7 +245,7 @@
 // wipe out the cache when they iterate over a large numbers
 // of terms in order
 if (tiOrd == null) {
-  termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+  termsCache.put(cacheKey, new TermInfoAndOrd(ti,
enumerator.position));
 } else {
   assert sameTermInfo(ti, tiOrd, enumerator);
   assert (int) enumerator.position == tiOrd.termOrd;
@@ -262,7 +262,7 @@
 // random-access: must seek
 final int indexPos;
 if (tiOrd != null) {
-  indexPos = tiOrd.termOrd / totalIndexInterval;
+  indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
 } else {
   // Must do binary search:
   indexPos = getIndexOffset(term);
@@ -274,7 +274,7 @@
 if (enumerator.term() != null && term.compareTo(enumerator.term()) == 0) {
   ti = enumerator.termInfo();
   if (tiOrd == null) {
-termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position));
   } else {
 assert sameTermInfo(ti, tiOrd, enumerator);
 assert (int) enumerator.position == tiOrd.termOrd;

Mike

http://blog.mikemccandless.com

On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom  wrote:
> The query below results in an array out of bounds exception:
> select/?q=solr&version=2.2&start=0&rows=0&facet=true&facet.field=topicStr
>
> Here is the exception:
>  Exception during facet.field of 
> topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
>        at 
> org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
>
> We are using a dev version of Solr/Lucene:
>
> Solr Specification Version: 3.0.0.2010.11.19.16.00.54
> Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54
> Lucene Specification Version: 3.1-SNAPSHOT
> Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10
>
> Just before the exception we see this entry in our tomcat logs:
>
> Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
> INFO: UnInverted multi-valued field 
> {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
> Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute
>
> Is this a known bug?  Can anyone provide a clue as to how we can determine 
> what the problem is?
>
> Tom Burton-West
>
>
> Appended Below is the exception stack trace:
>
> SEVERE: Exception during facet.field of 
> topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
>        at 
> org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
>        at 
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271)
>        at 
> org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338)
>        at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928)
>        at 
> org.apache.lucene.index.DirectoryReader$MultiTermEnum.(DirectoryReader.java:1055)
>        at 
> org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:659)
>        at 
> org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
>        at 
> org.apache.solr.request.NumberedTermEnum.skipTo(UnInvertedField.java:1018)
>        at 
> org.apache.solr.request.UnInvertedField.getTermText(UnInvertedField.java:838)
>        at 
> org.apache.solr.request.UnInvertedField.getCounts(UnInvertedField.java:617)
>        at 
> org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:279)
>        at 
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:312)
>        at 
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:174)
>        at 
> org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72)
>        at 
> org.apache.solr.handler.component.SearchHa

Re: Can I set up a config-based distributed search

2011-04-11 Thread lboutros

You can add to your search handler the "shards" parameter :



 host1/solr, host2/solr



Is is what you are looking for ?

Ludovic.

2011/4/11 Ran Peled [via Lucene] <
ml-node+2806331-346788257-383...@n3.nabble.com>

> In the Distributed Search page (
> http://wiki.apache.org/solr/DistributedSearch), it is documented that in
> order to perform a distributed search over a sharded index, I should use
> the
> "shards" request parameter, listing the shards to participate in the search
>
> (e.g. ?shards=localhost:8983/solr,localhost:7574/solr).   I am planning a
> new pretty large index (1B+ items).  Say I have a 100 shards, specifying
> the
> shards on the request URL becomes unrealistic due to length of URL.  It is
> also redundant to do that on every request.
>
> Is there a way to specify the list of shards in a configuration file,
> instead of on the query URL?  I have seen references to relevant config in
> SolrCloud, but as I understand it planned to be released only in Solr 4.0.
>
> Thanks,
> Ran
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Can-I-set-up-a-config-based-distributed-search-tp2806331p2806331.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here.
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-I-set-up-a-config-based-distributed-search-tp2806331p2806763.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a way to create multiple using DIH and access the data pertaining to a particular ?

Hi All,

I am new to solr. I want to implement solr search.

I have to implement two search buttons(1. books and 2. computers and both
are in the same datasource) which are completely different there is no
relation between each other.
Could you please let know how to define the entities in data-config.xml and
also on schema.xml.

Is it possible to do something like:


  
 
  

 
   

   


   
 
   

   



Thanks,
Mike
 
  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-create-multiple-doc-using-DIH-and-access-the-data-pertaining-to-a-particular-doc-n-tp1877203p2806787.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing Best Practice

2011-04-11 Thread Shaun Campbell

If it's of any help I've split the processing of PDF files from the
indexing. I put the PDF content into a text file (but I guess you could load
it into a database) and use that as part of the indexing.  My processing of
the PDF files also compares timestamps on the document and the text file so
that I'm only processing documents that have changed.

I am a newbie so perhaps there's more sophisticated approaches.

Hope that helps.
Shaun

On 11 April 2011 07:20, Darx Oman  wrote:

> Hi guys
>
> I'm wondering how to best configure solr to fulfills my requirements.
>
> I'm indexing data from 2 data sources:
> 1- Database
> 2- PDF files (password encrypted)
>
> Every file has related information stored in the database.  Both the file
> content and the related database fields must be indexed as one document in
> solr.  Among the DB data is *per-user* permissions for every document.
>
> The file contents nearly never change, on the other hand, the DB data and
> especially the permissions change very frequently which require me to
> re-index everything for every modified document.
>
> My problem is in process of decrypting the PDF files before re-indexing
> them
> which takes too much time for a large number of documents, it could span to
> days in full re-indexing.
>
> What I'm trying to accomplish is eliminating the need to re-index the PDF
> content if not changed even if the DB data changed.  I know this is not
> possible in solr, because solr doesn't update documents.
>
> So how to best accomplish this:
>
> Can I use 2 indexes one for PDF contents and the other for DB data and have
> a common id field for both as a link between them, *and results are treated
> as one Document*?
>

Reloading synonyms.txt without downtime

Hi,

Apparently, when one RELOADs a core, the synonyms file is not reloaded.  Is 
this 

the expected behaviour?  Is it the desired behaviour?

Here's the use-case:
When one is doing purely query-time synonym expansion, ideally one would be 
able 

to edit synonyms.txt and get it reloaded, so that the changes can start taking 
effect immediately.

One might think that RELOADing a Solr core would achieve this, but apparently 
this doesn't happen.  Should it?
Are there technical reasons why RELOADing a core should not reload the synonyms 
file? (other than if synonyms are used at index-time, changing the synonyms 
would mean that one has to reindex old docs in order for changes to synonyms to 
apply to old docs).

Issue https://issues.apache.org/jira/browse/SOLR-1307 mentions this a bit, but 
doesn't go in a lot of depth.

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: Can I set up a config-based distributed search

2011-04-11 Thread Jonathan Rochkind

I have not worked with shards/distributed, but I think you can probably 
specify them as defaults in your requesthandler in your solrconfig.xml 
instead.


Somewhere there is (or was) a wiki page on this I can't find right now. 
There's a way to specify (for a particular request handler) a default 
parameter value, such as for 'shards', that will be used if none were 
given with the request. There's also a way to specify an invariant that 
will always be used even if something else is passed in on the request.


Ah, found it: http://wiki.apache.org/solr/SearchHandler#Configuration

On 4/11/2011 8:31 AM, Ran Peled wrote:

In the Distributed Search page (
http://wiki.apache.org/solr/DistributedSearch), it is documented that in
order to perform a distributed search over a sharded index, I should use the
"shards" request parameter, listing the shards to participate in the search
(e.g. ?shards=localhost:8983/solr,localhost:7574/solr).   I am planning a
new pretty large index (1B+ items).  Say I have a 100 shards, specifying the
shards on the request URL becomes unrealistic due to length of URL.  It is
also redundant to do that on every request.

Is there a way to specify the list of shards in a configuration file,
instead of on the query URL?  I have seen references to relevant config in
SolrCloud, but as I understand it planned to be released only in Solr 4.0.

Thanks,
Ran

Re: Performance with search terms starting and ending with wildcards

Hi,

Perhaps you should give Lucene/Solr trunk a try and compare!  The Wildcard 
query 
in trunk should be much faster.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Ueland 
> To: solr-user@lucene.apache.org
> Sent: Sun, April 10, 2011 10:44:46 AM
> Subject: Performance with search terms starting and ending with wildcards
> 
> Hi!
> 
> I have been doing some testing with solr and wildcards. Queries  like:
> 
> - *foo
> - foo*
> 
> Does complete quickly(1-2s) in a test index  on about 40-50GB.
> 
> But when i try to do a search for *foo*, the search  time can without any
> trouble come upwards for 30seconds plus. 
> 
> Any  ideas on how that issue can be worked around? 
> 
> One fix would be to change  *foo* to (*foo or foo* or oof* or *oof) (is the
> reverse even needed?). But  that will not give the same results as *foo*,
> logicly enough.
> 
> I have  also tried to set maxTimeAllowed, but that is simply ignored. I guess
> that is  related to either sorting or the wildcard search itself. 
> 
> --
> View this  message in context: 
>http://lucene.472066.n3.nabble.com/Performance-with-search-terms-starting-and-ending-with-wildcards-tp2802561p2802561.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
>

Clarifying "fetchindex" command

Hi,

Can one actually *force* replication of the index from the master without a 
commit being issued on the master since the last replication?

I do see "Force a fetchindex on slave from master command: 
http://slave_host:port/solr/replication?command=fetchindex"; on 
http://wiki.apache.org/solr/SolrReplication#HTTP_API, but that feels more like 
"force the replication *now* instead of waiting for the slave to poll the 
master" than "force the replication even if there is no new commit point and no 
new index version on the master".  Which one is it, really?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom

Thanks Mike,

At first I thought this couldn't be related to the 2.1 Billion terms issue 
since the only place we have tons of terms is in the OCR field and this is not 
the OCR field. But then I remembered that the total number of terms in all 
fields is what matters. We've had no problems with regular searches against the 
index or with other facet queries.  Only with this facet.   Is TermInfoAndOrd 
only used for faceting?

I'll go ahead and build the patch and let you know.


Tom

p.s. Here is the field definition:




-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, April 11, 2011 8:40 AM
To: solr-user@lucene.apache.org
Cc: Burton-West, Tom
Subject: Re: ArrayIndexOutOfBoundsException with facet query

Tom,

I think I see where this may be -- it looks like another > 2B terms
bug in Lucene (we are using an int instead of a long in the
TermInfoAndOrd class inside TermInfosReader.java), only present in
3.1.

I'm also mad that Test2BTerms fails to catch this!!  I will go fix
that test and confirm it sees this bug.

Can you build from source?  If so, try this patch:

Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
===
--- lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(revision
1089906)
+++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java
(working copy)
@@ -46,8 +46,8 @@

   // Just adds term's ord to TermInfo
   private final static class TermInfoAndOrd extends TermInfo {
-final int termOrd;
-public TermInfoAndOrd(TermInfo ti, int termOrd) {
+final long termOrd;
+public TermInfoAndOrd(TermInfo ti, long termOrd) {
   super(ti);
   this.termOrd = termOrd;
 }
@@ -245,7 +245,7 @@
 // wipe out the cache when they iterate over a large numbers
 // of terms in order
 if (tiOrd == null) {
-  termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+  termsCache.put(cacheKey, new TermInfoAndOrd(ti,
enumerator.position));
 } else {
   assert sameTermInfo(ti, tiOrd, enumerator);
   assert (int) enumerator.position == tiOrd.termOrd;
@@ -262,7 +262,7 @@
 // random-access: must seek
 final int indexPos;
 if (tiOrd != null) {
-  indexPos = tiOrd.termOrd / totalIndexInterval;
+  indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
 } else {
   // Must do binary search:
   indexPos = getIndexOffset(term);
@@ -274,7 +274,7 @@
 if (enumerator.term() != null && term.compareTo(enumerator.term()) == 0) {
   ti = enumerator.termInfo();
   if (tiOrd == null) {
-termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
enumerator.position));
+termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position));
   } else {
 assert sameTermInfo(ti, tiOrd, enumerator);
 assert (int) enumerator.position == tiOrd.termOrd;

Mike

http://blog.mikemccandless.com

On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom  wrote:
> The query below results in an array out of bounds exception:
> select/?q=solr&version=2.2&start=0&rows=0&facet=true&facet.field=topicStr
>
> Here is the exception:
>  Exception during facet.field of 
> topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
>        at 
> org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
>
> We are using a dev version of Solr/Lucene:
>
> Solr Specification Version: 3.0.0.2010.11.19.16.00.54
> Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54
> Lucene Specification Version: 3.1-SNAPSHOT
> Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10
>
> Just before the exception we see this entry in our tomcat logs:
>
> Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
> INFO: UnInverted multi-valued field 
> {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
> Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute
>
> Is this a known bug?  Can anyone provide a clue as to how we can determine 
> what the problem is?
>
> Tom Burton-West
>
>
> Appended Below is the exception stack trace:
>
> SEVERE: Exception during facet.field of 
> topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
>        at 
> org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
>        at 
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271)
>        at 
> org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338)
>        at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928)
>        at 
> org.apache.lucene.index.DirectoryReader$MultiTermEnum.(DirectoryReader.java:1055)
>        at 
> org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:659)
>        at 
> org

RE: Problems indexing very large set of documents

2011-04-11 Thread Brandon Waterloo

I found a simpler command-line method to update the PDF files.  On some 
documents it does so perfect, the result is a pixel-for-pixel match and none of 
the OCR text (which is what all these PDFs are, newspaper articles that have 
been passed through OCR) is lost.  However, on other documents the result is 
considerably blurrier and some of the OCR text is lost.

We've decided to skip any documents that Tika cannot index for now.

As Lance stated, it's not specifically the version that causes the problem but 
rather some quirks caused by different PDF writers, a few tests have confirmed 
this, so we can't use version to determine which should be skipped.  I'm 
examining the XML responses from the queries, and I cannot figure out how to 
tell from the XML response whether or not a document was successfully indexed.  
The status value seems to be 0 regardless of whether indexing was successful or 
not.

So my question is, how can I tell from the response whether or not indexing was 
actually successful?

~Brandon Waterloo

From: Lance Norskog [goks...@gmail.com]
Sent: Sunday, April 10, 2011 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Problems indexing very large set of documents

There is a library called iText. It parses and writes PDFs very very
well, and a simple program will let you do a batch conversion.  PDFs
are made by a wide range of programs, not just Adobe code. Many of
these do weird things and make small mistakes that Tika does not know
to handle. In other words there is "dirty PDF" just like "dirty HTML".

A percentage of PDFs will fail and that's life. One site that gets
press releases from zillions of sites (and thus a wide range of PDF
generators) has a 15% failure rate with Tika.

Lance

On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo
 wrote:
> I think I've finally found the problem.  The files that work are PDF version 
> 1.6.  The files that do NOT work are PDF version 1.4.  I'll look into 
> updating all the old documents to PDF 1.6.
>
> Thanks everyone!
>
> ~Brandon Waterloo
> 
> From: Ezequiel Calderara [ezech...@gmail.com]
> Sent: Friday, April 08, 2011 11:35 AM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> Maybe those files are created with a different Adobe Format version...
>
> See this: 
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo 
> mailto:brandon.water...@matrix.msu.edu>> 
> wrote:
> A second test has revealed that it is something to do with the contents, and 
> not the literal filenames, of the second set of files.  I renamed one of the 
> second-format files and tested it and Solr still failed.  However, the 
> problem still only applies to those files of the second naming format.
> 
> From: Brandon Waterloo 
> [brandon.water...@matrix.msu.edu]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can tell, 
> it appears Solr is tripping up over the filename.  These are strictly 
> examples, but, Solr handles this filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this 
> filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains multiple 
> periods.  As there are about 1700 files whose filenames are similar to the 
> second format it is simply not possible to change their filenames.  In 
> addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I 
> simply SOL until the Solr dev team can work on this? (assuming I put in a 
> ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> 
> From: Chris Hostetter 
> [hossman_luc...@fucit.org]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>   ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 errors.  What is strange is, nearly
> : every file succeeded before about the 2200-files-mark, and nearly every
> : file after that failed.
>
> ..the root question is: do those files *only* fail if you have already
> indexed ~2200 files, or do they fail if you start up your server and index
> them first?
>
> t

Re: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Michael McCandless

Right, it's the total number of terms across all fields... unfortunately.

This class is used to enroll a term into the terms cache that wraps
the terms dictionary, so in theory you could also hit this issue
during normal searching when a term is looked up once,  and then
looked up again (the 2nd time will pull from the cache).

I've mod'd Test2BTerms and am running it now...

Mike

http://blog.mikemccandless.com

On Mon, Apr 11, 2011 at 12:51 PM, Burton-West, Tom  wrote:
> Thanks Mike,
>
> At first I thought this couldn't be related to the 2.1 Billion terms issue 
> since the only place we have tons of terms is in the OCR field and this is 
> not the OCR field. But then I remembered that the total number of terms in 
> all fields is what matters. We've had no problems with regular searches 
> against the index or with other facet queries.  Only with this facet.   Is 
> TermInfoAndOrd only used for faceting?
>
> I'll go ahead and build the patch and let you know.
>
>
> Tom
>
> p.s. Here is the field definition:
>  multiValued="true"/>
>  omitNorms="true"/>
>
>
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, April 11, 2011 8:40 AM
> To: solr-user@lucene.apache.org
> Cc: Burton-West, Tom
> Subject: Re: ArrayIndexOutOfBoundsException with facet query
>
> Tom,
>
> I think I see where this may be -- it looks like another > 2B terms
> bug in Lucene (we are using an int instead of a long in the
> TermInfoAndOrd class inside TermInfosReader.java), only present in
> 3.1.
>
> I'm also mad that Test2BTerms fails to catch this!!  I will go fix
> that test and confirm it sees this bug.
>
> Can you build from source?  If so, try this patch:
>
> Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
> ===
> --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
> (revision
> 1089906)
> +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
> (working copy)
> @@ -46,8 +46,8 @@
>
>   // Just adds term's ord to TermInfo
>   private final static class TermInfoAndOrd extends TermInfo {
> -    final int termOrd;
> -    public TermInfoAndOrd(TermInfo ti, int termOrd) {
> +    final long termOrd;
> +    public TermInfoAndOrd(TermInfo ti, long termOrd) {
>       super(ti);
>       this.termOrd = termOrd;
>     }
> @@ -245,7 +245,7 @@
>             // wipe out the cache when they iterate over a large numbers
>             // of terms in order
>             if (tiOrd == null) {
> -              termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
> enumerator.position));
> +              termsCache.put(cacheKey, new TermInfoAndOrd(ti,
> enumerator.position));
>             } else {
>               assert sameTermInfo(ti, tiOrd, enumerator);
>               assert (int) enumerator.position == tiOrd.termOrd;
> @@ -262,7 +262,7 @@
>     // random-access: must seek
>     final int indexPos;
>     if (tiOrd != null) {
> -      indexPos = tiOrd.termOrd / totalIndexInterval;
> +      indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
>     } else {
>       // Must do binary search:
>       indexPos = getIndexOffset(term);
> @@ -274,7 +274,7 @@
>     if (enumerator.term() != null && term.compareTo(enumerator.term()) == 0) {
>       ti = enumerator.termInfo();
>       if (tiOrd == null) {
> -        termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
> enumerator.position));
> +        termsCache.put(cacheKey, new TermInfoAndOrd(ti, 
> enumerator.position));
>       } else {
>         assert sameTermInfo(ti, tiOrd, enumerator);
>         assert (int) enumerator.position == tiOrd.termOrd;
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom  wrote:
>> The query below results in an array out of bounds exception:
>> select/?q=solr&version=2.2&start=0&rows=0&facet=true&facet.field=topicStr
>>
>> Here is the exception:
>>  Exception during facet.field of 
>> topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
>>        at 
>> org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
>>
>> We are using a dev version of Solr/Lucene:
>>
>> Solr Specification Version: 3.0.0.2010.11.19.16.00.54
>> Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 
>> 16:00:54
>> Lucene Specification Version: 3.1-SNAPSHOT
>> Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10
>>
>> Just before the exception we see this entry in our tomcat logs:
>>
>> Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert
>> INFO: UnInverted multi-valued field 
>> {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0}
>> Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute
>>
>> Is this a known bug?  Can anyone provide a clue as to how we can determine 
>> what the problem is?
>>
>> Tom Burton-West
>>
>>
>

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom

Thanks Mike,

With the unpatched version, the first time I run the facet query on topicStr it 
works fine, but the second time I get the ArrayIndexOutOfBoundsException.   If 
I try different facets such as language, I don't see the same symptoms.  Maybe 
the number of facet values needs to exceed some number to trigger the bug?

I rebuilt lucene-core-3.1-SNAPSHOT.jar  with your patch and it fixes the 
problem. 


Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Monday, April 11, 2011 1:00 PM
To: Burton-West, Tom
Cc: solr-user@lucene.apache.org
Subject: Re: ArrayIndexOutOfBoundsException with facet query

Right, it's the total number of terms across all fields... unfortunately.

This class is used to enroll a term into the terms cache that wraps
the terms dictionary, so in theory you could also hit this issue
during normal searching when a term is looked up once,  and then
looked up again (the 2nd time will pull from the cache).

I've mod'd Test2BTerms and am running it now...

Mike

http://blog.mikemccandless.com

On Mon, Apr 11, 2011 at 12:51 PM, Burton-West, Tom  wrote:
> Thanks Mike,
>
> At first I thought this couldn't be related to the 2.1 Billion terms issue 
> since the only place we have tons of terms is in the OCR field and this is 
> not the OCR field. But then I remembered that the total number of terms in 
> all fields is what matters. We've had no problems with regular searches 
> against the index or with other facet queries.  Only with this facet.   Is 
> TermInfoAndOrd only used for faceting?
>
> I'll go ahead and build the patch and let you know.
>
>
> Tom
>
> p.s. Here is the field definition:
>  multiValued="true"/>
>  omitNorms="true"/>
>
>
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, April 11, 2011 8:40 AM
> To: solr-user@lucene.apache.org
> Cc: Burton-West, Tom
> Subject: Re: ArrayIndexOutOfBoundsException with facet query
>
> Tom,
>
> I think I see where this may be -- it looks like another > 2B terms
> bug in Lucene (we are using an int instead of a long in the
> TermInfoAndOrd class inside TermInfosReader.java), only present in
> 3.1.
>
> I'm also mad that Test2BTerms fails to catch this!!  I will go fix
> that test and confirm it sees this bug.
>
> Can you build from source?  If so, try this patch:
>
> Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java
> ===
> --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
> (revision
> 1089906)
> +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java        
> (working copy)
> @@ -46,8 +46,8 @@
>
>   // Just adds term's ord to TermInfo
>   private final static class TermInfoAndOrd extends TermInfo {
> -    final int termOrd;
> -    public TermInfoAndOrd(TermInfo ti, int termOrd) {
> +    final long termOrd;
> +    public TermInfoAndOrd(TermInfo ti, long termOrd) {
>       super(ti);
>       this.termOrd = termOrd;
>     }
> @@ -245,7 +245,7 @@
>             // wipe out the cache when they iterate over a large numbers
>             // of terms in order
>             if (tiOrd == null) {
> -              termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
> enumerator.position));
> +              termsCache.put(cacheKey, new TermInfoAndOrd(ti,
> enumerator.position));
>             } else {
>               assert sameTermInfo(ti, tiOrd, enumerator);
>               assert (int) enumerator.position == tiOrd.termOrd;
> @@ -262,7 +262,7 @@
>     // random-access: must seek
>     final int indexPos;
>     if (tiOrd != null) {
> -      indexPos = tiOrd.termOrd / totalIndexInterval;
> +      indexPos = (int) (tiOrd.termOrd / totalIndexInterval);
>     } else {
>       // Must do binary search:
>       indexPos = getIndexOffset(term);
> @@ -274,7 +274,7 @@
>     if (enumerator.term() != null && term.compareTo(enumerator.term()) == 0) {
>       ti = enumerator.termInfo();
>       if (tiOrd == null) {
> -        termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int)
> enumerator.position));
> +        termsCache.put(cacheKey, new TermInfoAndOrd(ti, 
> enumerator.position));
>       } else {
>         assert sameTermInfo(ti, tiOrd, enumerator);
>         assert (int) enumerator.position == tiOrd.termOrd;
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom  wrote:
>> The query below results in an array out of bounds exception:
>> select/?q=solr&version=2.2&start=0&rows=0&facet=true&facet.field=topicStr
>>
>> Here is the exception:
>>  Exception during facet.field of 
>> topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
>>        at 
>> org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201)
>>
>> We are using a dev version of Solr/Lucene:
>>
>> Solr Specification Version: 3.0.0.2010.11.19.16.00.54
>> Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 20

Lucene Revolution 2011 - Early Bird Ends April 18

2011-04-11 Thread Michael Bohlig


A quick reminder that there's one week left on special pricing for Lucene 
Revolution 2011. Sign up this week and save some serious cash:

- Conference Registration, now $545, a savings of $180 over the $725 late 
registration price
- Training Package with 2-day Training plus Conference Registration now 
$1695, a savings of $200 over the 
  $1895 late registration package price (and even more savings over the a 
la carte pricing)

What can you expect at the conference?

- Keynote presentations from The Guardian News and Media’s Stephen Dunn and 
Redmonk’s Stephen O’Grady
- Session track talks on use cases, tutorials and technology strategy at 
leading edge, innovative
  companies, including: Travelocity, eBay, eHarmony, EMC, Etsy, Trulia, 
Intuit, Careerbuilder, AT&T, The
  Ladders and more
- Deep internals and implementation guidance at talks by Apache Solr/Lucene 
committers including Grant 
  Ingersoll, Yonik Seeley, Andrzej Bialecki, Uwe Schindler, Simon 
Willnauer, Erik Hatcher, Otis 
  Gospodnetic, and more.

You will also have an unmatched opportunity to network with over 400 of your 
peers from the open source search ecosystem, in all sectors of government, 
universities, start-ups, Fortune 1000 companies, and the developer and user 
community. 

Register at: http://us.ootoweb.com/luceneregistration

P.S. There are also a few free tickets left for the San Francisco Giants vs. 
Florida Marlins game on May 24!


Michael Bohlig | Lucid Imagination 
Enterprise Marketing 
p +1 650 353 4057 x132 
m+1 650 703 8383 
www.lucidimagination.com

Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

2011-04-11 Thread karsten-solr

Hi Lance,

I used XPathEntityProcessor with attribut "xsl" and generate a xml-File "in the 
form of the standard Solr update schema".
I lost a lot of performance, it is a pity that XPathEntityProcessor does only 
use one thread.

My tests with a collection of 350T Document:
1. use of XPathRecordReader without xslt:  28min
2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min  
2. use of XPathEntityProcessor with saxon-xslt: 36min  


Best regards
  Karsten



 Lance 
> There is an option somewhere to use the full XML DOM implementation
> for using xpaths. The purpose of the XPathEP is to be as simple and
> dumb as possible and handle most cases: RSS feeds and other open
> standards.
> 
> Search for xsl(optional)
> 
> http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
> 
--karsten
> > Hi Folks,
> >
> > does anyone improve DIH XPathRecordReader to deal with nested xpaths?
> > e.g.
> > data-config.xml with
> >   >  
> >  
> > and the XML stream contains
> >  /html/body/h1...
> > will only fill field “alltext” but field “title” will be empty.
> >
> > This is a known issue from 2009
> >
> https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
> >
> > So three questions:
> > 1. How to fill a “search over all”-Field without nested xpaths?
> >   (schema.xml   will not help,
> because we lose the original token order)
> > 2. Does anyone try to improve XPathRecordReader to deal with nested
> xpaths?
> > 3. Does anyone else need this feature?
> >
> >
> > Best regards
> >  Karsten
> >

http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Will Slaves Pileup Replication Requests?

2011-04-11 Thread Parker Johnson


What is the slave replication behavior if a replication request to pull
indexes takes longer than the replication interval itself?

Anotherwords, if my replication interval is set to be every 30 seconds,
and my indexes are significantly large enough to take longer than 30
seconds to transfer, is the slave smart enough to not send another
replication request if one is already in progress?


-Parker

Re: Will Slaves Pileup Replication Requests?

2011-04-11 Thread Green, Larry (CMG - Digital)

Yes. It will wait whatever the replication interval is after the most recent 
replication completes before attempting again.

On Apr 11, 2011, at 2:42 PM, Parker Johnson wrote:

> 
> What is the slave replication behavior if a replication request to pull
> indexes takes longer than the replication interval itself?
> 
> Anotherwords, if my replication interval is set to be every 30 seconds,
> and my indexes are significantly large enough to take longer than 30
> seconds to transfer, is the slave smart enough to not send another
> replication request if one is already in progress?
> 
> 
> -Parker
> 
>

Re: Exact match on a field with stemming

Hi,

Using quoted means "use this as a phrase", not "use this as a literal". :)
I think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Pierre-Luc Thibeault 
> To: solr-user@lucene.apache.org
> Sent: Mon, April 11, 2011 2:55:04 PM
> Subject: Exact match on a field with stemming
> 
> Hi all,
> 
> Is there a way to perform an exact match query on a field that  has stemming 
>enable by using the standard /select handler?
> 
> I thought  that putting word inside double-quotes would enable this behaviour 
>but if I  query my field with a single word like “manager”
> I am receiving results  containing the word “management”
> 
> I know I can use a CopyField with  different types but that would double the 
>size of my index… Is there an  alternative?
> 
> Thanks
>

Re: Will Slaves Pileup Replication Requests?

2011-04-11 Thread Parker Johnson


Thanks Larry.

-Parker

On 4/11/11 12:14 PM, "Green, Larry (CMG - Digital)"
 wrote:

>Yes. It will wait whatever the replication interval is after the most
>recent replication completes before attempting again.
>
>On Apr 11, 2011, at 2:42 PM, Parker Johnson wrote:
>
>> 
>> What is the slave replication behavior if a replication request to pull
>> indexes takes longer than the replication interval itself?
>> 
>> Anotherwords, if my replication interval is set to be every 30 seconds,
>> and my indexes are significantly large enough to take longer than 30
>> seconds to transfer, is the slave smart enough to not send another
>> replication request if one is already in progress?
>> 
>> 
>> -Parker
>> 
>> 
>
>

Question on Dismax plugin

2011-04-11 Thread Nemani, Raj

All,

I have a question on the Dismax plugin for the search handler.  I have
two test instances of Solr.  In one I am using the default search
handler.  In this case, the fields that I am working with (slug and
story) are indexed via the all_text filed and the searches are done on
the all_text field.

For the other one I have configured a search handler using the dismax
plugin as shown below.

 





 dismax

 explicit

 0.01

 

story^3.0 slug^0.2

 

 100

 *:*

 

  

 

To make testing easier, I only have 4 (same) documents in both indexes
with the word "Obama" appearing inside as described below.

 

File 1:: The word Obama appears zero times in "slug" field and four
times in "story" field

File 2:: The word Obama appears zero times in "slug" field and thrice in
"story" field

File 3:: The word Obama appears zero times in "slug" field and two times
in "story" field

File 4:: The word Obama appears One time in "slug" field and one time in
"story" field

 

 

Here is the order of the documents in the order of decreasing scores
from the search results

 

Dismax Search Handler (steadily decreasing scores):

* File 1:: The word Obama appears zero times in "slug" field and
four times in "story" field

* File 4:: The word Obama appears One time in "slug" field and
one time in "story" field

* File 2:: The word Obama appears zero times in "slug" field and
thrice in "story" field

* File 3:: The word Obama appears zero times in "slug" field and
two times in "story" field

 

Standard Search handler:

* File 1:: The word Obama appears zero times in "slug" field and
four times in "story" field

* File 2:: The word Obama appears zero times in "slug" field and
thrice in "story" field (same score as File 4 score below)

* File 4:: The word Obama appears One time in "slug" field and
one time in "story" field (same score as File 2 score above)

* File 3:: The word Obama appears zero times in "slug" field and
two times in "story" field

 

 

My question, why is dismax showing "File 4:: The word Obama appears One
time in "slug" field and one time in "story" field" 

ahead of 

"File 2:: The word Obama appears zero times in "slug" field and thrice
in "story" field" given that I have boosted these fields as shown below.


 



story^3.0 slug^0.2



 

I would have thought that the ""File 4:: The word Obama appears One time
in "slug" field and one time in "story" field" would have gone all the
way done in the result list.

 

Any help is appreciated

Thanks much in advance

Raj

Re: Question on Dismax plugin

Hi Raj,

I'm guessing your slug field is much shorter and thus a match in that field has 
more weight than a match is a much longer story field.  If you omit norms for 
those fields in the schema (and reindex), I believe you will see File 4 drop to 
position #4.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: "Nemani, Raj" 
> To: solr-user@lucene.apache.org
> Sent: Mon, April 11, 2011 4:12:52 PM
> Subject: Question on Dismax plugin
> 
> All,
> 
> I have a question on the Dismax plugin for the search handler.   I have
> two test instances of Solr.  In one I am using the default  search
> handler.  In this case, the fields that I am working with (slug  and
> story) are indexed via the all_text filed and the searches are done  on
> the all_text field.
> 
> For the other one I have configured a search  handler using the dismax
> plugin as shown below.
> 
> 
> 
> 
> 
> 
> 
>   dismax
> 
>  explicit
> 
>  0.01
> 
>  
> 
> story^3.0  slug^0.2
> 
>  
> 
>  100
> 
>  *:*
> 
>  
> 
>
> 
> 
> 
> To make testing easier, I only have 4  (same) documents in both indexes
> with the word "Obama" appearing inside as  described below.
> 
> 
> 
> File 1:: The word Obama appears zero times in  "slug" field and four
> times in "story" field
> 
> File 2:: The word Obama  appears zero times in "slug" field and thrice in
> "story" field
> 
> File  3:: The word Obama appears zero times in "slug" field and two times
> in  "story" field
> 
> File 4:: The word Obama appears One time in "slug" field  and one time in
> "story" field
> 
> 
> 
> 
> 
> Here is the order of  the documents in the order of decreasing scores
> from the search  results
> 
> 
> 
> Dismax Search Handler (steadily decreasing  scores):
> 
> * File 1:: The word Obama appears  zero times in "slug" field and
> four times in "story" field
> 
> *  File 4:: The word Obama appears One time in "slug" field  and
> one time in "story" field
> 
> * File 2::  The word Obama appears zero times in "slug" field and
> thrice in "story"  field
> 
> * File 3:: The word Obama appears zero  times in "slug" field and
> two times in "story" field
> 
> 
> 
> Standard  Search handler:
> 
> * File 1:: The word Obama  appears zero times in "slug" field and
> four times in "story"  field
> 
> * File 2:: The word Obama appears zero  times in "slug" field and
> thrice in "story" field (same score as File 4 score  below)
> 
> * File 4:: The word Obama appears One  time in "slug" field and
> one time in "story" field (same score as File 2  score above)
> 
> * File 3:: The word Obama  appears zero times in "slug" field and
> two times in "story" field
> 
> 
> 
> 
> 
> My question, why is dismax showing "File 4:: The word Obama  appears One
> time in "slug" field and one time in "story" field" 
> 
> ahead  of 
> 
> "File 2:: The word Obama appears zero times in "slug" field and  thrice
> in "story" field" given that I have boosted these fields as shown  below.
> 
> 
> 
> 
> 
> 
>  story^3.0 slug^0.2
> 
> 
> 
> 
> 
> I  would have thought that the ""File 4:: The word Obama appears One time
> in  "slug" field and one time in "story" field" would have gone all the
> way done  in the result list.
> 
> 
> 
> Any help is appreciated
> 
> Thanks much  in advance
> 
> Raj
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Mongo REST interface and full data import

2011-04-11 Thread andrew_s

Thank you guys for your answers.
I didn't recognise that it will be so easy to do it and example from
http://wiki.apache.org/solr/UpdateJSON#Example works perfectly for me.

Regards,
Andrew

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2808507.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: MoreLikeThis match

2011-04-11 Thread Brian Lamb

Does anyone have any thoughts on this one?

On Fri, Apr 8, 2011 at 9:26 AM, Brian Lamb wrote:

> I've looked at both wiki pages and none really clarify the difference
> between these two. If I copy and paste an existing index value for field and
> do an mlt search, it shows up under match but not results. What is the
> difference between these two?
>
>
> On Thu, Apr 7, 2011 at 2:24 PM, Brian Lamb 
> wrote:
>
>> Actually, what is the difference between "match" and "response"? It seems
>> that match always returns one result but I've thrown a few cases at it where
>> the score of the highest response is higher than the score of match. And
>> then there are cases where the match score dwarfs the highest response
>> score.
>>
>>
>> On Thu, Apr 7, 2011 at 1:30 PM, Brian Lamb > > wrote:
>>
>>> Hi all,
>>>
>>> I've been using MoreLikeThis for a while through select:
>>>
>>> http://localhost:8983/solr/select/?q=field:more like
>>> this&mlt=true&mlt.fl=field&rows=100&fl=*,score
>>>
>>> I was looking over the wiki page today and saw that you can also do this:
>>>
>>> http://localhost:8983/solr/mlt/?q=field:more like
>>> this&mlt=true&mlt.fl=field&rows=100
>>>
>>> which seems to run faster and do a better job overall. When the results
>>> are returned, they are formatted like this:
>>>
>>> 
>>>   
>>> 0
>>> 1
>>>   
>>>   
>>> 
>>>   3.0438285
>>>   5
>>> 
>>>   
>>>   >> maxScore="0.12775186">
>>> 
>>>   0.1125823
>>>   3
>>> 
>>> 
>>>   0.10231556
>>>   8
>>> 
>>>  ...
>>>   
>>> 
>>>
>>> It seems that it always returns just 1 response under match and response
>>> is set by the rows parameter. How can I get more than one result under
>>> match?
>>>
>>> What I'm trying to do here is whatever is set for field:, I would like to
>>> return the top 100 records that match that search based on more like this.
>>>
>>> Thanks,
>>>
>>> Brian Lamb
>>>
>>
>>
>

Too many open files exception related to solrj getServer too often?

2011-04-11 Thread cyang2010

Hi,

I get this solrj error in development environment.

org.apache.solr.client.solrj.SolrServerException: java.net.SocketException:
Too many open files

At the time there was no reindexing or any write to the index.   There were
only different queries genrated using solrj to hit solr server:

CommonsHttpSolrServer server = new CommonsHttpSolrServer(url);
server.setSoTimeout(1000); // socket read timeout
server.setConnectionTimeout(1000);
server.setDefaultMaxConnectionsPerHost(100);
server.setMaxTotalConnections(100);
...
QueryResponse rsp = server.query(solrQuery);

I did NOT share reference of solrj CommonsHttpSolrServer among requests.  
So every http request will obtain a solj solr server instance and run query
on it.  

The question is:

1. Should solrj client share one instance of CommonHttpSolrServer?   Why? 
Is every CommonHttpSolrServer matched to one solr/lucene reader?  But from
the source code, it just shows it related to one apache http client.

2. Is TooManyOpenFiles exeption related to my possible wrong usage of
CommonHttpSolrServer?

3. server.query(solrQuery) throws SolrServerException.  How can concurrent
solr queries triggers Too many open file exception?


Look forward to your input.  Thanks,



cy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-open-files-exception-related-to-solrj-getServer-too-often-tp2808718p2808718.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Exact match on a field with stemming

2011-04-11 Thread Jean-Sebastien Vachon

I'm curious to know why Solr is not respecting the phrase.
If it consider "manager" as a phrase... shouldn't it return only document 
containing that phrase?

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means "use this as a phrase", not "use this as a literal". :) I 
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
search :: http://search-lucene.com/



- Original Message 
> From: Pierre-Luc Thibeault 
> To: solr-user@lucene.apache.org
> Sent: Mon, April 11, 2011 2:55:04 PM
> Subject: Exact match on a field with stemming
> 
> Hi all,
> 
> Is there a way to perform an exact match query on a field that  has 
>stemming enable by using the standard /select handler?
> 
> I thought  that putting word inside double-quotes would enable this 
>behaviour but if I  query my field with a single word like “manager”
> I am receiving results  containing the word “management”
> 
> I know I can use a CopyField with  different types but that would 
>double the size of my index… Is there an  alternative?
> 
> Thanks
>

FW: Exact match on a field with stemming

2011-04-11 Thread Jonathan Rochkind


> I'm curious to know why Solr is not respecting the phrase.
> If it consider "manager" as a phrase... shouldn't it return only document 
> containing that phrase?

A phrase means to solr (or rather to the lucene and dismax query parsers, which 
are what understand double-quoted phrases)  "these tokens in exactly this order"

So a phrase of one token "manager", is exactly the same as if you didn't use 
the double quotes. It's only one token, so "all the tokens in this phrase in 
exactly the order specified" is, well, just the same as one token without 
phrase quotes. 

If you've set up a stemmed field at indexing time, then "manager" and 
"management" are stemmed IN THE INDEX, probably to something like "manag".  
There is no longer any information in the index (at least in that field) on 
what the original literal was, it's been stemmed in the index.  So there's no 
way possible for it to only match certain un-stemmed versions -- at least using 
that field. And when you enter either 'manager' or 'management' at query time, 
it is analyzed and stemmed to match that stemmed something-like "manag" in the 
index either way. If it didn't analyze and stem at query time, then instead the 
query would just match NOTHING, because neither 'manager' nor 'management' are 
in the index at all, only the stemmed versions. 

So, yes, double quotes are interpreted as a phrase, and only documents 
containing that phrase are returned, you got it. 


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means "use this as a phrase", not "use this as a literal". :) I 
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
search :: http://search-lucene.com/



- Original Message 
> From: Pierre-Luc Thibeault 
> To: solr-user@lucene.apache.org
> Sent: Mon, April 11, 2011 2:55:04 PM
> Subject: Exact match on a field with stemming
>
> Hi all,
>
> Is there a way to perform an exact match query on a field that  has
>stemming enable by using the standard /select handler?
>
> I thought  that putting word inside double-quotes would enable this
>behaviour but if I  query my field with a single word like “manager”
> I am receiving results  containing the word “management”
>
> I know I can use a CopyField with  different types but that would
>double the size of my index… Is there an  alternative?
>
> Thanks
>

Re: when to change rows param?


Paul: can you elaborate a little bit on what exactly your problem is?

 - what is the full component list you are using?
 - how are you changing the param value (ie: what does the code look like)
 - what isn't working the way you expect?

: I've been using my own QueryComponent (that extends the search one) 
: successfully to rewrite web-received parameters that are sent from the 
: (ExtJS-based) javascript client. This allows an amount of 
: query-rewriting, that's good. I tried to change the rows parameter there 
: (which is "limit" in the query, as per the underpinnings of ExtJS) but 
: it seems that this is not enough.
: 
: Which component should I subclass to change the rows parameter?

-Hoss

Re: Deduplication questions


: Q1. Is is possible to pass *analyzed* content to the
: 
: public abstract class Signature {

No, analysis happens as the documents are being written to the lucene 
index, well after the UpdateProcessors have had a chance to interact with 
the values.

: Q2. Method calculate() is using concatenated fields from name,features,cat
: Is there any mechanism I could build  "field dependant signatures"?

At the moment the Signature API is fairly minimal, but it could definitley 
be improved by adding more methods (that have sensible defaults in the 
base class) that would give the impl more control over teh resulting 
signature ... we just beed people to propose good suggestions with example 
use cases.

: Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
: would work)

I don't know that what you describe is really intentional or not, but it 
should work


-Hoss

Re: XML not coming through from nabble to Gmail

2011-04-11 Thread Michael Sokolov

I see the same problem (missing markup) in Thunderbird. Seems like 
Nabble might be the culprit?


-Mike

On 4/11/2011 8:13 AM, Erick Erickson wrote:

All:

Lately I've been seeing a lot of posts where people paste in parts of their
schema.xml or solrconfig.xml and the results are...er...disappointing. None
of the less-than or greater-than symbols show and the formatting is all over
the map.

Since some mails would come through with the XML formatted and some would be
wonky, at first I thought it was the sender, but then a pretty high
percentage came through this way. So I poked around and it seems to only be
the case that the XML is "wonkified" (tm) when it's comes to Gmail from
nabble, the original post on nabble has the markup and displays fine.
Behavior is the same in Chrome and Firefox BTW.

Does anyone have any insight into this? Time to complain to the nabble
folks? Do others see this with non-Gmail relays?

Thanks,
Erick

Re: Solr 1.4.1 compatible with Lucene 3.0.1?

Hi,

I only read the short story. :)
Note that you should post questions like this on solr-user@lucene list, which 
is 
where I'm replying now.

Since you are just starting with Solr, why not grab the recently released 3.1?  
That way you'll get the latest Lucene and the latest Solr.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: RichSimon 
> To: solr-...@lucene.apache.org
> Sent: Mon, April 11, 2011 10:36:46 AM
> Subject: Solr 1.4.1 compatible with Lucene 3.0.1?
> 
> 
> Short story: I am using Lucene 3.0.1, and I'm trying to run Solr 1.4.1.  I
> get an error starting the embedded Solr server that says it cannot find  the
> method FSDirectory.getDirectory. The release notes for Solr 1.4.1 says it  is
> compatible with Lucene 2.9.3, and I see that Lucene 3.0.1 does not have  the
> FSDirectory.getDirectory method any more. Dorwngrading Lucene to 2.9.x  is
> not an option for me. What version of Solr should I use for Lucene  3.0.1?
> (We're just starting with Solr, so changing that version is not hard.)  Or,
> do I have to upgrade both Solr and  Lucene?
> 
> Thanks,
> 
> -Rich
> 
> Here's the long story:
> I am using  Lucene 3.0.1, and I'm trying to run Solr 1.4.1. I have not used
> any other  version of Lucene. We have an existing project using Lucene 3.0.1,
> and we  want to start using Solr. When I try to initialize an embedded Solr
> server,  like so:
> 
> 
>  String solrHome =  PATH_TO_SOLR_HOME;
> File  home = new File(solrHome);
>  File solrXML = new File(home, "solr.xml");
>  
>  coreContainer = new CoreContainer();
>  coreContainer.load(solrHome, solrXML);
>
>  embeddedSolr = new EmbeddedSolrServer(coreContainer,  SOLR_CORE);
> 
> 
> 
> [04-08  11:48:39] ERROR CoreContainer [main]:  java.lang.NoSuchMethodError:
>org.apache.lucene.store.FSDirectory.getDirectory(Ljava/lang/String;)Lorg/apache/lucene/store/FSDirectory;
>;
> at
>org.apache.solr.spelling.AbstractLuceneSpellChecker.initIndex(AbstractLuceneSpellChecker.java:186)
>)
> at
>org.apache.solr.spelling.AbstractLuceneSpellChecker.init(AbstractLuceneSpellChecker.java:101)
>)
> ;  at
>org.apache.solr.spelling.IndexBasedSpellChecker.init(IndexBasedSpellChecker.java:56)
>)
> at
>org.apache.solr.handler.component.SpellCheckComponent.inform(SpellCheckComponent.java:274)
>)
> ;  at
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:508)
>  at org.apache.solr.core.SolrCore.(SolrCore.java:588)
> at  org.apache.solr.core.CoreContainer.create(CoreContainer.java:428)
>  at  org.apache.solr.core.CoreContainer.load(CoreContainer.java:278)
> 
> 
> Looking at Google posts about this, it seemed that this can be caused by  a
> version mismatch between the Lucene version in use and the one Solr tries  to
> use. I noticed a Lucene version tag in the example solrconfig.xml that  I’m
> modifying:
> 
>   LUCENE_40
> 
> I changing it to LUCENE_301,  changing it to LUCENE_30, and commenting it
> out, but I still get the same  error. Using
> LucenePackage.get().getImplementationVersion() shows this as the  Lucene
> version:
>   
> Lucene version: 3.0.1 912433 -  2010-02-21 23:51:22
> 
> I also printed my classpath and found the following  lucene  jars:
> lucene-analyzers-3.0.1.jar
> lucene-core-3.0.1.jar
> lucene-highlighter-3.0.1.jar
> lucene-memory-3.0.1.jar
> lucene-misc-2.9.3.jar
> lucene-queries-2.9.3.jar
> lucene-snowball-2.9.3.jar
> lucene-spellchecker-2.9.3.jar
> 
> The  FSDirectory class is in lucene-core. I decompiled the class file in the
> jar,  and did not see a getDirectory method. Also, I used a ClassLoader
> statement  to get an instance of the FSDirectory class my code is using, and
> printed out  the methods; no getDirectory method.
> 
> I gather from the Lucene Javadoc  that the getDirectory method is in
> FSDirectory for 2.4.0 and for 2.9.0, but  is gone in 3.0.1 (the version I'm
> using). 
> 
> Is Lucene 3.0.1 completely  incompatible with Solr 1.4.1? Is there some way
> to use the luceneMatchVersion  tag to tell Solr what version I want to use?
> 
> 
> --
> View this message  in context: 
>http://lucene.472066.n3.nabble.com/Solr-1-4-1-compatible-with-Lucene-3-0-1-tp2806828p2806828.html
>
> Sent  from the Solr - Dev mailing list archive at  Nabble.com.
> 
> -
> To  unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For  additional commands, e-mail: dev-h...@lucene.apache.org
> 
>

Re: partial optimize does not reduce the segment number to maxNumSegments


: I have a core with 120+ segment files and I tried partial optimize specify
: maxNumSegments=10, after the optimize the segment files reduced to 64 files;

a) the option you want to specify is "maxSegments" .. not "maxNumSegments"

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22optimize.22

b) i can't reproduce this ... i just created an index with 200 segments 
and when i hit the example url from the wiki...

curl 
'http://localhost:8983/solr/update?optimize=true&maxSegments=10&waitFlush=false'

...my index was correctly optimized down to 10 segments.

is it possible that you just didn't wait long enough and you were 
observing the number of segments while the optimize was still taking 
place?


-Hoss

Re: does overwrite=false work with json

: I tried it with the example json documents, and even if I add 
: overwrite=false to the URL, it still overwrites.
: 
: Do this twice:
: curl 'http://localhost:8983/solr/update/json?commit=true&overwrite=false' 
--data-binary @books.json -H 'Content-type:application/json'

...the JSON Update Request Handler doesn't have any docs suggesting that 
you can specify "overwrite" in the URL.

the CSVUpdateRequestHandler supports a lot of options in the URL because 
it doesn't have any way of specifying them as part of the content stream, 
but both the XML and JSON formats look for these options as part of hte 
individual "add" commands.  There is a specifica example of using 
"overwrite" on the JSON Wiki page...

http://wiki.apache.org/solr/UpdateJSON?highlight=%28overwrite%29#Update_Commands

...that said, it would certianly be nice if the default value for all of 
the options that can be specified as part of the data payloads for 
XML/JSON ("overwrite", "commitWithin", "boost", etc...) could be specified 
in the URL ... feel free to open a Jira requesting this.


-Hoss

Re: XML not coming through from nabble to Gmail


: I see the same problem (missing markup) in Thunderbird. Seems like Nabble
: might be the culprit?

if someone can cite some specific examples (by email message-id, or 
subject, or date+sender, or url from nabble, or url from any public 
archive, or anything more specific then "posts from nabble containing 
xml") we can check the official apache mail archive which contains the 
"raw message" as recieved by ezmlm., such as..

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201104.mbox/raw/%3cbanlktimcpthzalstrwhn3rtzpxdzkbo...@mail.gmail.com%3E



-Hoss

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Joey Hanzel

Awesome. Thanks Jayendra.  I hadn't caught these patches yet.

I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the
problem of archive files not being unpacked and indexed with Solr CELL.
Thanks for the FYI.
https://issues.apache.org/jira/browse/SOLR-2416

On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil <
jayendra.patil@gmail.com> wrote:

> The migration of Tika to the latest 0.8 version seems to have
> reintroduced the issue.
>
> I was able to get this working again with the following patches. (Solr
> Cell and Data Import handler)
>
> https://issues.apache.org/jira/browse/SOLR-2416
> https://issues.apache.org/jira/browse/SOLR-2332
>
> You can try these.
>
> Regards,
> Jayendra
>
> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel 
> wrote:
> > Hi Gary,
> >
> > I have been experiencing the same problem... Unable to extract content
> from
> > archive file formats.  I just tried again with a clean install of Solr
> 3.1.0
> > (using Tika 0.8) and continue to experience the same results.  Did you
> have
> > any success with this problem with Solr 1.4.1 or 3.1.0 ?
> >
> > I'm using this curl command to send data to Solr.
> > curl "
> >
> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true
> "
> > -H "application/octet-stream" -F  "myfile=@data.zip"
> >
> > No problem extracting single rich text documents, but archive files only
> > result in the file names within the archive being indexed. Am I missing
> > something else in my configuration? Solr doesn't seem to be unpacking the
> > archive files. Based on the email chain associated with your first
> message,
> > some people have been able to get this functionality to work as desired.
> >
> > On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor  wrote:
> >
> >> Can anyone shed any light on this, and whether it could be a config
> issue?
> >>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
> >>
> >> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)
> to
> >> the ExtractingRequestHandler, I get the following log entry (formatted
> for
> >> ease of reading) :
> >>
> >> SolrInputDocument[
> >>{
> >>ignored_meta=ignored_meta(1.0)={
> >>[stream_source_info, file, stream_content_type,
> >> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
> >> Content-Type, application/zip]
> >>},
> >>ignored_=ignored_(1.0)={
> >>[package-entry, package-entry]
> >>},
> >>ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  
> ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
> >>
> >>ignored_stream_size=ignored_stream_size(1.0)={260},
> >>ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
> >>ignored_content_type=ignored_content_type(1.0)={application/zip},
> >>docid=docid(1.0)={74},
> >>type=type(1.0)={5},
> >>text=text(1.0)={  doc2.txtdoc1.txt}
> >>}
> >> ]
> >>
> >> So, the data coming back from Tika when parsing a ZIP file does not
> include
> >> the file contents, only the names of the files contained therein.  I've
> >> tried forcing stream.type=application/zip in the CURL string, but that
> makes
> >> no difference.  If I specify an invalid stream.type then I get an
> exception
> >> response, so I know it's being used.
> >>
> >> When I send one of those txt files individually to the
> >> ExtractingRequestHandler, I get:
> >>
> >> SolrInputDocument[
> >>{
> >>ignored_meta=ignored_meta(1.0)={
> >>[stream_source_info, file, stream_content_type, text/plain,
> >> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
> >>},
> >>ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
> >>ignored_stream_size=ignored_stream_size(1.0)={30},
> >>ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
> >>ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
> >>docid=docid(1.0)={74},
> >>type=type(1.0)={5},
> >>text=text(1.0)={The quick brown fox  }
> >>}
> >> ]
> >>
> >> and we see the file contents in the "text" field.
> >>
> >> I'm using the following requestHandler definition in solrconfig.xml:
> >>
> >> 
> >> text
> >> true
> >> ignored_
> >>
> >> 
> >> true
> >> links
> >> ignored_
> >> 
> >> 
> >>
> >> Is there any further debug or diagnostic I can get out of Tika to help
> me
> >> work out why it's only returning the file names and not the file
> contents
> >> when parsing a ZIP file?
> >>
> >>
> >> Thanks and kind regards,
> >> Gary.
> >>
> >>
> >>
> >> On 25/01/2011 16:48, Jayendra Patil wrote:
> >>
> >>> Hi Gary,
> >>>
> >>> The latest Solr Trunk was able to extract and index the contents of the
> >>> zip
> >>> file using the ExtractingRequestHandler.
> >>> The snapshot of Trunk we worked u

RE: Exact match on a field with stemming

2011-04-11 Thread Jean-Sebastien Vachon

Thanks for the clarification. This make sense.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: April-11-11 7:54 PM
To: solr-user@lucene.apache.org
Subject: FW: Exact match on a field with stemming

> I'm curious to know why Solr is not respecting the phrase.
> If it consider "manager" as a phrase... shouldn't it return only document
containing that phrase?

A phrase means to solr (or rather to the lucene and dismax query parsers,
which are what understand double-quoted phrases)  "these tokens in exactly
this order"

So a phrase of one token "manager", is exactly the same as if you didn't use
the double quotes. It's only one token, so "all the tokens in this phrase in
exactly the order specified" is, well, just the same as one token without
phrase quotes. 

If you've set up a stemmed field at indexing time, then "manager" and
"management" are stemmed IN THE INDEX, probably to something like "manag".
There is no longer any information in the index (at least in that field) on
what the original literal was, it's been stemmed in the index.  So there's
no way possible for it to only match certain un-stemmed versions -- at least
using that field. And when you enter either 'manager' or 'management' at
query time, it is analyzed and stemmed to match that stemmed something-like
"manag" in the index either way. If it didn't analyze and stem at query
time, then instead the query would just match NOTHING, because neither
'manager' nor 'management' are in the index at all, only the stemmed
versions. 

So, yes, double quotes are interpreted as a phrase, and only documents
containing that phrase are returned, you got it. 

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: April-11-11 3:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Exact match on a field with stemming

Hi,

Using quoted means "use this as a phrase", not "use this as a literal". :) I
think copying to unstemmed field is the only/common work-around.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
search :: http://search-lucene.com/

- Original Message 
> From: Pierre-Luc Thibeault 
> To: solr-user@lucene.apache.org
> Sent: Mon, April 11, 2011 2:55:04 PM
> Subject: Exact match on a field with stemming
>
> Hi all,
>
> Is there a way to perform an exact match query on a field that  has 
>stemming enable by using the standard /select handler?
>
> I thought  that putting word inside double-quotes would enable this 
>behaviour but if I  query my field with a single word like "manager"
> I am receiving results  containing the word "management"
>
> I know I can use a CopyField with  different types but that would 
>double the size of my index. Is there an  alternative?
>
> Thanks
>

=

Re: MoreLikeThis match

2011-04-11 Thread Mike Mattozzi

Match is the document that's the top result of the query (q param)
that you specify.

Response is the list of documents that are similar to the 'match' document.

-Mike

On Mon, Apr 11, 2011 at 4:55 PM, Brian Lamb
 wrote:
> Does anyone have any thoughts on this one?
>
> On Fri, Apr 8, 2011 at 9:26 AM, Brian Lamb 
> wrote:
>
>> I've looked at both wiki pages and none really clarify the difference
>> between these two. If I copy and paste an existing index value for field and
>> do an mlt search, it shows up under match but not results. What is the
>> difference between these two?
>>
>>
>> On Thu, Apr 7, 2011 at 2:24 PM, Brian Lamb 
>> wrote:
>>
>>> Actually, what is the difference between "match" and "response"? It seems
>>> that match always returns one result but I've thrown a few cases at it where
>>> the score of the highest response is higher than the score of match. And
>>> then there are cases where the match score dwarfs the highest response
>>> score.
>>>
>>>
>>> On Thu, Apr 7, 2011 at 1:30 PM, Brian Lamb >> > wrote:
>>>
 Hi all,

 I've been using MoreLikeThis for a while through select:

 http://localhost:8983/solr/select/?q=field:more like
 this&mlt=true&mlt.fl=field&rows=100&fl=*,score

 I was looking over the wiki page today and saw that you can also do this:

 http://localhost:8983/solr/mlt/?q=field:more like
 this&mlt=true&mlt.fl=field&rows=100

 which seems to run faster and do a better job overall. When the results
 are returned, they are formatted like this:

 
   
     0
     1
   
   
     
       3.0438285
       5
     
   
   >>> maxScore="0.12775186">
     
       0.1125823
       3
     
     
       0.10231556
       8
     
  ...
   
 

 It seems that it always returns just 1 response under match and response
 is set by the rows parameter. How can I get more than one result under
 match?

 What I'm trying to do here is whatever is set for field:, I would like to
 return the top 100 records that match that search based on more like this.

 Thanks,

 Brian Lamb

>>>
>>>
>>
>

Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1

The DIH has multi-threading. You can have one thread fetching files
and then give them to different threads.

On Mon, Apr 11, 2011 at 11:40 AM,   wrote:
> Hi Lance,
>
> I used XPathEntityProcessor with attribut "xsl" and generate a xml-File "in 
> the form of the standard Solr update schema".
> I lost a lot of performance, it is a pity that XPathEntityProcessor does only 
> use one thread.
>
> My tests with a collection of 350T Document:
> 1. use of XPathRecordReader without xslt:  28min
> 2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min
> 2. use of XPathEntityProcessor with saxon-xslt: 36min
>
>
> Best regards
>  Karsten
>
>
>
>  Lance
>> There is an option somewhere to use the full XML DOM implementation
>> for using xpaths. The purpose of the XPathEP is to be as simple and
>> dumb as possible and handle most cases: RSS feeds and other open
>> standards.
>>
>> Search for xsl(optional)
>>
>> http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1
>>
> --karsten
>> > Hi Folks,
>> >
>> > does anyone improve DIH XPathRecordReader to deal with nested xpaths?
>> > e.g.
>> > data-config.xml with
>> >  > >  
>> >  
>> > and the XML stream contains
>> >  /html/body/h1...
>> > will only fill field “alltext” but field “title” will be empty.
>> >
>> > This is a known issue from 2009
>> >
>> https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose
>> >
>> > So three questions:
>> > 1. How to fill a “search over all”-Field without nested xpaths?
>> >   (schema.xml   will not help,
>> because we lose the original token order)
>> > 2. Does anyone try to improve XPathRecordReader to deal with nested
>> xpaths?
>> > 3. Does anyone else need this feature?
>> >
>> >
>> > Best regards
>> >  Karsten
>> >
>
> http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr under Tomcat

Hi Mike-

Please start a new thread for this.

On Mon, Apr 11, 2011 at 2:47 AM, Mike  wrote:
> Hi All,
>
> I have installed solr instance on tomcat6. When i tried to index the PDF
> file i was able to see the response:
>
>
> 0
> 479
>
>
> Query:
> http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdf&stream.contentType=application/pdf&literal.id=Struts%202%20Design%20and%20Programming1.pdf&defaultField=text&commit=true
>
> But when i tried to search the content in the pdf i could not get any
> results:
>
>
>
> 0
> 2
> −
>
> on
> 0
> struts
> 10
> 2.2
>
>
>
>
>
> Could you please let me know if I am doing anything wrong. It works fine
> when i tried with default jetty server prior to integrating on the tomcat6.
>
> I have followed installation steps from
> http://wiki.apache.org/solr/SolrTomcat
> (Tomcat on Windows Single Solr app).
>
> Thanks,
> Mike
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-under-Tomcat-tp2613501p2805970.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com

Re: Indexing Best Practice

SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource.
This means that you can read the database and PDFs separately. You
could index all of the PDF content in one DIH script. Then, when
there's a database update, you have a separate DIH scripts that reads
the old row from Solr, and pulls the stripped text from the PDF, and
then re-indexes the whole thing. This would cut out the need to
reparse the PDF.

Lance

On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell
 wrote:
> If it's of any help I've split the processing of PDF files from the
> indexing. I put the PDF content into a text file (but I guess you could load
> it into a database) and use that as part of the indexing.  My processing of
> the PDF files also compares timestamps on the document and the text file so
> that I'm only processing documents that have changed.
>
> I am a newbie so perhaps there's more sophisticated approaches.
>
> Hope that helps.
> Shaun
>
> On 11 April 2011 07:20, Darx Oman  wrote:
>
>> Hi guys
>>
>> I'm wondering how to best configure solr to fulfills my requirements.
>>
>> I'm indexing data from 2 data sources:
>> 1- Database
>> 2- PDF files (password encrypted)
>>
>> Every file has related information stored in the database.  Both the file
>> content and the related database fields must be indexed as one document in
>> solr.  Among the DB data is *per-user* permissions for every document.
>>
>> The file contents nearly never change, on the other hand, the DB data and
>> especially the permissions change very frequently which require me to
>> re-index everything for every modified document.
>>
>> My problem is in process of decrypting the PDF files before re-indexing
>> them
>> which takes too much time for a large number of documents, it could span to
>> days in full re-indexing.
>>
>> What I'm trying to accomplish is eliminating the need to re-index the PDF
>> content if not changed even if the DB data changed.  I know this is not
>> possible in solr, because solr doesn't update documents.
>>
>> So how to best accomplish this:
>>
>> Can I use 2 indexes one for PDF contents and the other for DB data and have
>> a common id field for both as a link between them, *and results are treated
>> as one Document*?
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Tika, Solr running under Tomcat 6 on Debian

Ah! Did you set the UTF-8 parameter in Tomcat?

On Mon, Apr 11, 2011 at 2:49 AM, Mike  wrote:
> Hi Roy,
>
> Thank you for the quick reply. When i tried to index the PDF file i was able
> to see the response:
>
>
> 0
> 479
>
>
>
> Query:
> http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdf&stream.contentType=application/pdf&literal.id=Struts%202%20Design%20and%20Programming1.pdf&defaultField=text&commit=true
>
> But when i tried to search the content in the pdf i could not get any
> results:
>
>
>
> 0
> 2
> −
>
> on
> 0
> struts
> 10
> 2.2
>
>
>
>
>
> Could you please let me know if I am doing anything wrong. It works fine
> when i tried with default jetty server prior to integrating on the tomcat6.
>
> I have followed installation steps from
> http://wiki.apache.org/solr/SolrTomcat
> (Tomcat on Windows Single Solr app).
>
> Thanks,
> Mike
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805974.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr 3.1 performance compared to 1.4.1