solr / lucene engineering positions in Boston, MA USA @ the Echo Nest

2010-09-10 Thread Brian Whitman
Hi all, brief message to let you know that we're in heavy hire mode at the
Echo Nest. As many of you know we are very heavy solr/lucene users (~1bn
documents across many many servers) and a lot of our staff have been working
with and contributing to the projects over the years. We are a "music
intelligence" company -- we crawl the web and do a lot of fancy math on
music audio and text to then provide things like recommendation, feeds,
remix capabilities, playlisting etc to a lot of music labels, social
networks and small developers via a very popular API.

We are especially interested in people with Lucene & Solr experience who
aren't afraid to get into the guts and push it to its limits. If any of
these positions fit you please let me know. We are hiring full time in the
Boston area (Davis Square, Somerville) for senior and junior engineers as
well as data architects.

http://the.echonest.com/company/jobs/

http://developer.echonest.com/docs/v4/

http://the.echonest.com/


autocommit commented out -- what is the default?

2010-12-04 Thread Brian Whitman
Hi, if you comment out the block in solrconfig.xml



Does this mean that (a) commits never happen automatically or (b) some
default autocommit is applied?


"document commit" possible?

2008-06-23 Thread Brian Whitman
Could the commit operation be adapted to just have the searchers aware  
of new stored content in a particular document?

e.g. 

With the understanding that queries for newly indexed fields in this  
document will not return this newly added document, but a query for  
the document by its id will return any new stored fields. When the  
"real" commit (read: the commit that takes 10 minutes to complete)  
returns the newly indexed fields will be query-able.






Re: diversity in results

2008-08-04 Thread Brian Whitman

On Aug 4, 2008, at 12:50 PM, Jason Rennie wrote:

Is there any option in solr to encourage diversity in the results?   
Our solr
index has millions of products, many of which are quite similar to  
each

other.  Even something simple like max 50% text overlap in successive
results would be valuable.  Does something like this exist in solr  
or are

there any plans to add it?



not out of the box, but I would use the mlt handler on the first  
result and remove all the ones that appear in both the MLT and query  
response.


B



partialResults, distributed search & SOLR-502

2008-08-15 Thread Brian Whitman

I was going to file a ticket like this:

"A SOLR-303 query with &shards=host1,host2,host3 when host3 is down  
returns an error. One of the advantages of a shard implementation is  
that data can be stored redundantly across different shards, either as  
direct copies (e.g. when host1 and host3 are snapshooter'd copies of  
each other) or where there is some "data RAID" that stripes indexes  
for redundancy."


But then I saw SOLR-502, which appears to be committed.

If I have the above scenario (host1,host2,host3 where host3 is not up)  
and set a timeAllowed, will I still get a 400 or will it come back  
with "partial" results? If not, can we think of a way to get this to  
work? It's my understanding already that duplicate docIDs are merged  
in the SOLR-303 response, so other than building in some "this host  
isn't working, just move on and report it" and of course the work to  
index redundantly, we wouldn't need anything to achieve a good  
redundant shard implementation.


B




Re: partialResults, distributed search & SOLR-502

2008-08-18 Thread Brian Whitman

On Aug 18, 2008, at 11:51 AM, Ian Connor wrote:
On Mon, Aug 18, 2008 at 9:31 AM, Ian Connor <[EMAIL PROTECTED]>  
wrote:

I don't think this patch is working yet. If I take a shard out of
rotation (even just one out of four), I get an error:

org.apache.solr.client.solrj.SolrServerException:
java.net.ConnectException: Connection refused




It's my understanding that SOLR-502 is really only concerned with  
queries timing out (i.e. they connect but take over N seconds to  
return) If the connection gets refused then a non-solr java connection  
exception is thrown. Something would have to get put in that  
(optionally) catches connection errors and still builds the response  
from the shards that did respond.






On Fri, Aug 15, 2008 at 1:23 PM, Brian Whitman <[EMAIL PROTECTED] 
> wrote:

I was going to file a ticket like this:

"A SOLR-303 query with &shards=host1,host2,host3 when host3 is  
down returns
an error. One of the advantages of a shard implementation is that  
data can
be stored redundantly across different shards, either as direct  
copies (e.g.
when host1 and host3 are snapshooter'd copies of each other) or  
where there

is some "data RAID" that stripes indexes for redundancy."

But then I saw SOLR-502, which appears to be committed.

If I have the above scenario (host1,host2,host3 where host3 is not  
up) and

set a timeAllowed, will I still get a 400 or will it come back with
"partial" results? If not, can we think of a way to get this to  
work? It's
my understanding already that duplicate docIDs are merged in the  
SOLR-303
response, so other than building in some "this host isn't working,  
just move
on and report it" and of course the work to index redundantly, we  
wouldn't

need anything to achieve a good redundant shard implementation.

B







--
Regards,

Ian Connor





--
Regards,

Ian Connor


--
http://variogr.am/





Re: partialResults, distributed search & SOLR-50

2008-08-18 Thread Brian Whitman

On Aug 18, 2008, at 12:31 PM, Yonik Seeley wrote:


On Mon, Aug 18, 2008 at 12:16 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
Yes, as far as I know, what Brian said is correct.  Also, as far as  
I know, there is nothing that gracefully handles problematic Solr  
instances during distributed search.


Right... we punted that issue to a load balancer (which assumes that
you have more than one copy of each shard).



Can you explain how you have a LB handling shards? Do you put a  
separate LB in front of each group of replica shards?




Re: which shard is a result coming from

2008-08-19 Thread Brian Whitman


On Aug 19, 2008, at 8:49 AM, Ian Connor wrote:

What is the current "special requestHandler" that you can set  
currently?


If you're referring to my issue post, that's just something we have  
internally (not in trunk solr) that we use instead of /update -- it  
just inserts a hostname:port/solr into the  
incoming XML doc add stream. Not very clean but it works. Use lars's  
patch.






in a RequestHandler's init, how to get solr data dir?

2008-08-26 Thread Brian Whitman
I want to be able to store non-solr data in solr's data directory  
(like solr/solr/data/stored alongside solr/solr/data/index)
The java class that sets up this data is instantiated from a  
RequestHandlerBase class like:


public class StoreDataHandler extends RequestHandlerBase {
  StoredData storedData;
  @Override
  public void init(NamedList args)
  {
super.init(args);
String dataDirectory = 
storedData = new StoredData(dataDirectory);
  }

  @Override
  public void handleRequestBody(SolrQueryRequest req,  
SolrQueryResponse rsp) throws Exception

...

req.getCore() etc will eventually get me solr's data directory  
location, but how do I get it in the init method? I want to init the  
data store once on solr launch, not on every call.


What do I replace those  above with?





Re: in a RequestHandler's init, how to get solr data dir?

2008-08-26 Thread Brian Whitman


On Aug 26, 2008, at 12:24 PM, Shalin Shekhar Mangar wrote:


Hi Brian,

You can implement the SolrCoreAware interface which will give you  
access to
the SolrCore object through the SolrCoreAware#inform method you will  
need to

implement. It is called after the init method.


Shalin, that worked. Thanks a ton!




Re: Adding a field?

2008-08-26 Thread Brian Whitman


On Aug 26, 2008, at 3:09 PM, Jon Drukman wrote:

Is there a way to add a field to an existing index without stopping  
the server, deleting the index, and reloading every document from  
scratch?




You can add a field to the schema at any time without adversely  
affecting the rest of the index. You have to restart the server, but  
you don't have to re-index existing documents. Of course, only new  
documents with that field specified will come back in queries.


You can also define dynamic fields like x_* which would let you add  
any field name you want without restarting the server.





UpdateRequestProcessorFactory / Chain etc

2008-09-06 Thread Brian Whitman
Trying to build a simple UpdateRequestProcessor that keeps a field  
(the time of original index) when overwriting a document.


1) Can I make a updateRequestProcessor chain only work as a certain  
handler or does putting the following in my solrconfig.xml:


 

   
   
 

Just handle all document updates?

2) Does a UpdateRequestProcessor support inform ?






Re: UpdateRequestProcessorFactory / Chain etc

2008-09-06 Thread Brian Whitman

Answered my own qs, I think:


Trying to build a simple UpdateRequestProcessor that keeps a field  
(the time of original index) when overwriting a document.


1) Can I make a updateRequestProcessor chain only work as a certain  
handler or does putting the following in my solrconfig.xml:



   
  
  


Just handle all document updates?




What you have to do is:

  class="solr.XmlUpdateRequestHandler" >


 KeepIndexed

  



   
   


And then calls to /update2 will go through the chain. Calls to /update  
will not.





2) Does a UpdateRequestProcessor support inform ?



No, not that I can tell. And the factory won't get instantiated until  
the first time you use it.








Re: UpdateRequestProcessorFactory / Chain etc

2008-09-07 Thread Brian Whitman
Hm... I seem to be having trouble getting either the Factory or the  
Processor to do an init() for me.


The end result I'd like to see is a function that gets called only  
once, either on solr init or the first time the handler is called.  I  
can't seem to do that.


I have these two classes:

public class KeepIndexedDateFactory extends  
UpdateRequestProcessorFactory

with a getInstance method

and then

class KeepIndexedDateProcessor extends UpdateRequestProcessor
with a processAdd method


The init() on both classes is never called, ever.

The getInstance() method of the first class is called every time I add  
a doc, so I can't init stuff there.


inform() of the first class is called if I add a implements  
SolrCoreAware -- but the class I need to instantiate once is only  
needed in the second class.


I hope this makes sense -- java is not my first language.






Re: UpdateRequestProcessorFactory / Chain etc

2008-09-07 Thread Brian Whitman


On Sep 7, 2008, at 2:04 PM, Brian Whitman wrote:

Hm... I seem to be having trouble getting either the Factory or the  
Processor to do an init() for me.


The end result I'd like to see is a function that gets called only  
once, either on solr init or the first time the handler is called.   
I can't seem to do that.




Here's my code, and a solution I think works -- is there a better way  
to do this:



public class KeepIndexedDateFactory extends  
UpdateRequestProcessorFactory implements SolrCoreAware

{

  DataClassIWantToInstantiateOnce data;
  KeepIndexedDateProcessor p;

  public void inform(SolrCore core) {
data = new DataClassIWantToInstantiateOnce(null);

  }

  public UpdateRequestProcessor getInstance(SolrQueryRequest req,  
SolrQueryResponse rsp, UpdateRequestProcessor next)

  {
KeepIndexedDateProcessor p = new KeepIndexedDateProcessor(next);
p.associateData(data);
return p;
  }
}


class KeepIndexedDateProcessor extends UpdateRequestProcessor
{
  DataClassIWantToInstantiateOnce data;
  public KeepIndexedDateProcessor( UpdateRequestProcessor next) {
super( next );
  }

  public void associateData(Data d) { data = d; }

  @Override
  public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
String id = doc.getFieldValue( "id" ).toString();
if( id != null ) {
  SolrQuery getIndexedDatesOfId = new SolrQuery();
  getIndexedDatesOfId.setQuery("id:"+id);
  getIndexedDatesOfId.setFields("indexed");
  getIndexedDatesOfId.setRows(1);
  QueryResponse qr = data.query(getIndexedDatesOfId);
  if(qr != null) {
if(qr.getResults() != null) {
  if(qr.getResults().size()>0) {
Date thisIndexed =  
(Date)qr.getResults().get(0).getFieldValue("indexed");

doc.setField("indexed", thisIndexed);
  }
}
  }
}
// pass it up the chain
super.processAdd(cmd);
  }

}



RequestHandler that passes along the query

2008-10-03 Thread Brian Whitman
Not sure if this is possible or easy: I want to make a requestHandler that
acts just like select but does stuff with the output before returning it to
the client.
e.g. 
http://url/solr/myhandler?q=type:dog&sort=legsdesc&shards=dogserver1;dogserver2

When myhandler gets it, I'd like to take the results of that query as if I
sent it to select, then do stuff with the output before returning it. For
example, it would add a field to each returned document from an external
data store.

This is sort of like an UpdateRequestProcessor chain thing, but for the
select side. Is this possible?

Alternately, I could have my custom RequestHandler do the query. But all I
have in the RequestHandler is a SolrQueryRequest. Can I pass that along to
something and get a SolrDocumentList back?


Re: RequestHandler that passes along the query

2008-10-04 Thread Brian Whitman
Thanks grant and ryan, so far so good. But I am confused about one thing -
when I set this up like:

  public void process(ResponseBuilder rb) throws IOException {

And put it as the last-component on a distributed search (a defaults shard
is defined in the solrconfig for the handler), the component never does its
thing. I looked at the TermVectorComponent implementation and it instead
defines

public int distributedProcess(ResponseBuilder rb) throws IOException {

And when I implemented that method it works. Is there a way to define just
one method that will work with both distributed and normal searches?



On Fri, Oct 3, 2008 at 4:41 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> No need to even write a new ReqHandler if you're using 1.3:
> http://wiki.apache.org/solr/SearchComponent
>


Re: RequestHandler that passes along the query

2008-10-04 Thread Brian Whitman
Sorry for the extended question, but I am having trouble making
 SearchComponent that can actually get at the returned response in a
distributed setup.
In my distributedProcess:

public int distributedProcess(ResponseBuilder rb) throws IOException {

How can I get at the returned results from all shards? I want to get at
really the rendered response right before it goes back to the client so I
can add some information based on what came back.

The TermVector example seems to get at rb.resultIds (which is not public and
I can't use in my plugin) and then sends a request back to the shards to get
the stored fields (using ShardDoc.id, another field I don't have access to.)
Instead of doing all of that I'd like to just "peek" into the response that
is about to be written to the client.

I tried getting at rb.rsp but the data is not filled in during the last
stage (GET_FIELDS) that distributedProcess gets called for.



On Sat, Oct 4, 2008 at 10:12 AM, Brian Whitman <[EMAIL PROTECTED]> wrote:

> Thanks grant and ryan, so far so good. But I am confused about one thing -
> when I set this up like:
>
>   public void process(ResponseBuilder rb) throws IOException {
>
> And put it as the last-component on a distributed search (a defaults shard
> is defined in the solrconfig for the handler), the component never does its
> thing. I looked at the TermVectorComponent implementation and it instead
> defines
>
> public int distributedProcess(ResponseBuilder rb) throws IOException {
>
> And when I implemented that method it works. Is there a way to define just
> one method that will work with both distributed and normal searches?
>
>
>
> On Fri, Oct 3, 2008 at 4:41 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
>
>> No need to even write a new ReqHandler if you're using 1.3:
>> http://wiki.apache.org/solr/SearchComponent
>>
>


Re: RequestHandler that passes along the query

2008-10-04 Thread Brian Whitman
The issue I think is that process() is never called in my component, just
distributedProcess.
The server that hosts the component is a separate solr instance from the
shards, so my guess is process() is only called when that particular solr
instance has something to do with the index. distributedProcess() is called
for each of those stages, but the last stage it is called for is
GET_FIELDS.

But the WritingDistributedSearchComponents page did tip me off to a new
function, finishStage, that is called *after* each stage is done and does
exactly what I want:

  @Override

  public void finishStage(ResponseBuilder rb) {

if(rb.stage == ResponseBuilder.STAGE_GET_FIELDS) {

  SolrDocumentList sd = (SolrDocumentList) rb.rsp.getValues().get(
"response");

  for (SolrDocument d : sd) {

rb.rsp.add("second-id-list", d.getFieldValue("id").toString());

  }

}

  }






On Sat, Oct 4, 2008 at 1:37 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> I'm not totally on top of how distributed components work, but check:
> http://wiki.apache.org/solr/WritingDistributedSearchComponents
>
> and:
>  https://issues.apache.org/jira/browse/SOLR-680
>
> Do you want each of the shards to append values?  or just the final result?
>  If appending the values is not a big resource hog, it may make sense to
> only do that in the main "process" block.  If that is the case, I *think*
> you just implement: process(ResponseBuilder rb)
>
> ryan
>
>
>
> On Oct 4, 2008, at 1:06 PM, Brian Whitman wrote:
>
>  Sorry for the extended question, but I am having trouble making
>> SearchComponent that can actually get at the returned response in a
>> distributed setup.
>> In my distributedProcess:
>>
>>   public int distributedProcess(ResponseBuilder rb) throws IOException {
>>
>> How can I get at the returned results from all shards? I want to get at
>> really the rendered response right before it goes back to the client so I
>> can add some information based on what came back.
>>
>> The TermVector example seems to get at rb.resultIds (which is not public
>> and
>> I can't use in my plugin) and then sends a request back to the shards to
>> get
>> the stored fields (using ShardDoc.id, another field I don't have access
>> to.)
>> Instead of doing all of that I'd like to just "peek" into the response
>> that
>> is about to be written to the client.
>>
>> I tried getting at rb.rsp but the data is not filled in during the last
>> stage (GET_FIELDS) that distributedProcess gets called for.
>>
>>
>>
>> On Sat, Oct 4, 2008 at 10:12 AM, Brian Whitman <[EMAIL PROTECTED]>
>> wrote:
>>
>>  Thanks grant and ryan, so far so good. But I am confused about one thing
>>> -
>>> when I set this up like:
>>>
>>>  public void process(ResponseBuilder rb) throws IOException {
>>>
>>> And put it as the last-component on a distributed search (a defaults
>>> shard
>>> is defined in the solrconfig for the handler), the component never does
>>> its
>>> thing. I looked at the TermVectorComponent implementation and it instead
>>> defines
>>>
>>>   public int distributedProcess(ResponseBuilder rb) throws IOException {
>>>
>>> And when I implemented that method it works. Is there a way to define
>>> just
>>> one method that will work with both distributed and normal searches?
>>>
>>>
>>>
>>> On Fri, Oct 3, 2008 at 4:41 PM, Grant Ingersoll <[EMAIL PROTECTED]
>>> >wrote:
>>>
>>>  No need to even write a new ReqHandler if you're using 1.3:
>>>> http://wiki.apache.org/solr/SearchComponent
>>>>
>>>>
>>>
>


maxCodeLen in the doublemetaphone solr analyzer

2008-11-13 Thread Brian Whitman
I want to change the maxCodeLen param that is in Solr 1.3's doublemetaphone
plugin. Doc is here:
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/DoubleMetaphone.html
Is this something I can do in solrconfig or do I need to change it and
recompile?


Re: maxCodeLen in the doublemetaphone solr analyzer

2008-11-13 Thread Brian Whitman
oh, thanks! I didn't see that patch.
On Thu, Nov 13, 2008 at 3:40 PM, Feak, Todd <[EMAIL PROTECTED]> wrote:

> There's a patch in to do that as a separate filter. See
> https://issues.apache.org/jira/browse/SOLR-813
>


matching exact terms

2008-11-25 Thread Brian Whitman
This is probably severe user error, but I am curious about how to index docs
to make this query work:
happy birthday

to return the doc with n_name:"Happy Birthday" before the doc with
n_name:"Happy Birthday, Happy Birthday" . As it is now, the latter appears
first for a query of n_name:"happy birthday", the former second.

It would be great to do this at query time instead of having to re-index,
but I will if I have to!

The n_* type is defined as:


  




  









cannot allocate memory for snapshooter

2009-01-02 Thread Brian Whitman
I have an indexing machine on a test server (a mid-level EC2 instance, 8GB
of RAM) and I run jetty like:

java -server -Xms5g -Xmx5g -XX:MaxPermSize=128m
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap
-Dsolr.solr.home=/vol/solr -Djava.awt.headless=true -jar start.jar

The indexing master is set to snapshoot on commit. Sometimes (not always)
the snapshot fails with

SEVERE: java.io.IOException: Cannot run program "/vol/solr/bin/snapshooter":
java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(Unknown Source)

Why would snapshooter need more than 2GB ram?  /proc/meminfo says (with solr
running & nothing else)

MemTotal:  7872040 kB
MemFree:   2018404 kB
Buffers: 67704 kB
Cached:2161880 kB
SwapCached:  0 kB
Active:3446348 kB
Inactive:  2186964 kB
SwapTotal:   0 kB
SwapFree:0 kB
Dirty:   8 kB
Writeback:   0 kB
AnonPages: 3403728 kB
Mapped:  12016 kB
Slab:37804 kB
SReclaimable:20048 kB
SUnreclaim:  17756 kB
PageTables:   7476 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:   3936020 kB
Committed_AS:  5383624 kB
VmallocTotal: 34359738367 kB
VmallocUsed:   340 kB
VmallocChunk: 34359738027 kB


debugging long commits

2009-01-02 Thread Brian Whitman
We have a distributed setup that has been experiencing glacially slow commit
times on only some of the shards. (10s on a good shard, 263s on a slow
shard.) Each shard for this index has about 10GB of lucene index data and
the documents are segregated by an md5 hash, so the distribution of
document/data types should be equal across all shards. I've turned off our
postcommit hooks to isolate the problem, so it's not a snapshot run amok or
anything. I also moved the indexes over to new machines and the same indexes
that were slow in production are also slow on the test machines.
During the slow commit, the jetty process is 100% CPU / 50% RAM on a 8GB
quad core machine. The slow commit happens every time after I add at least
one document. (If I don't add any documents the commit is immediate.)

What can I do to look into this problem?


Re: debugging long commits

2009-01-02 Thread Brian Whitman
ng
on condition [0x..0x409303e0]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x2aabf9337400 nid=0x5da9 waiting
on condition [0x..0x408306b0]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x2aabf9314400 nid=0x5da8 in
Object.wait() [0x4072f000..0x4072faa0]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaabeb86f50> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(Unknown Source)
- locked <0x2aaabeb86f50> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

"Reference Handler" daemon prio=10 tid=0x2aabf9312800 nid=0x5da7 in
Object.wait() [0x4062e000..0x4062ed20]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaabeb86ec8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
- locked <0x2aaabeb86ec8> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x2aabf917a000 nid=0x5da6 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x4011c800 nid=0x5da4
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x4011e000 nid=0x5da5
runnable

"VM Periodic Task Thread" prio=10 tid=0x2aabf9342000 nid=0x5dad waiting
on condition

JNI global references: 971

Heap
 PSYoungGen  total 1395264K, used 965841K [0x2aab8d4b,
0x2aabf7f5, 0x2aabf7f5)
  eden space 1030400K, 93% used
[0x2aab8d4b,0x2aabc83d4788,0x2aabcc2f)
  from space 364864K, 0% used
[0x2aabe1b0,0x2aabe1b1,0x2aabf7f5)
  to   space 352320K, 0% used
[0x2aabcc2f,0x2aabcc2f,0x2aabe1b0)
 PSOldGentotal 3495296K, used 642758K [0x2aaab7f5,
0x2aab8d4b, 0x2aab8d4b)
  object space 3495296K, 18% used
[0x2aaab7f5,0x2aaadf301a78,0x2aab8d4b)
 PSPermGen   total 21248K, used 19258K [0x2ff5,
0x2aaab141, 0x2aaab7f5)
  object space 21248K, 90% used
[0x2ff5,0x2aaab121e8d8,0x2aaab141)

 num #instances #bytes  class name
--
   1:   6459678  491568792  [C
   2:   6456059  258242360  java.lang.String
   3:   6282264  251290560  org.apache.lucene.index.TermInfo
   4:   6282189  201030048  org.apache.lucene.index.Term
   5: 70220   39109632  [I
   6:  6082   25264288  [B
   7:   149   20355504  [J
   8:   133   20354208  [Lorg.apache.lucene.index.Term;
   9:   133   20354208  [Lorg.apache.lucene.index.TermInfo;
  10:1602308972880  java.nio.HeapByteBuffer
  11:1602188972208  java.nio.HeapCharBuffer
  12:1602108971760
 org.apache.lucene.index.FieldsReader$FieldForMerge
  13: 304404095480  
  14: 304403660128  
  15:  26053026184  
  16: 220653025120  [Ljava.lang.Object;
  17:  12972411792  [Ljava.util.HashMap$Entry;
  18: 486912309696  
  19:  26041981728  
  20:  21941889888  
  21: 274441317312  java.util.HashMap$Entry
  22: 24954 998160  java.util.AbstractList$Itr
  23: 18834 753360  org.apache.lucene.index.FieldInfo
  24:  2846 523664  java.lang.Class
  25: 13021 520840  java.util.ArrayList
  26: 12471 399072  org.apache.lucene.document.Document
  27:  3895 372216  [[I
  28:  3904 309592  [S
  29:   534 249632  
  30:  3451 220864
 org.apache.lucene.index.SegmentReader$Norm
  31:  1547 136136
 org.apache.lucene.store.FSDirectory$FSIndexInput
  32:   213 120984  
  33:   737 112024  java.lang.reflect.Method
  34:  1575 100800  java.lang.ref.Finalizer
  35:  1345  86080
 org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor
  36:  1188  76032  java.util.HashMap
...




On Fri, Jan 2, 2009 at 11:20 AM, Brian Whitman  wrote:

> We have a distributed setup that has been experiencing glacially slow
> commit times on only some of the shards. (10s on a good shard, 263s on a
> slow shard.) Each shard for this index has about 10GB of lucene index data
> and the documents are segregated by an md5 hash, so the distribution of
> document/data types should be equal across all shards. I've turned off our
> postcommit hooks to isolate the problem, so it's not a snapshot run amok or
> anything. I also moved the indexes over to new machines and the same indexes
> that were slow in production are also slow on the test machines.
> During the slow commit, the jetty process is 100% CPU / 50% RAM on a 8GB
> quad core machine. The slow commit happens every time after I add at least
> one document. (If I don't add any documents the commit is immediate.)
>
> What can I do to look into this problem?
>
>
>
>


Re: debugging long commits

2009-01-02 Thread Brian Whitman
I think I'm getting close with this (sorry for the self-replies)

I tried an optimize (which we never do) and it took 30m and said this a lot:

Exception in thread "Lucene Merge Thread #4"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:314)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of
range: 34950
at org.apache.lucene.util.BitVector.get(BitVector.java:91)
at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:125)
at
org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98)
at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:633)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:585)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:546)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:499)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4291)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260)
Jan 2, 2009 6:05:49 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: background merge hit exception: _ks4:C2504982
_oaw:C514635 _tll:C827949 _tdx:C18372 _te8:C19929 _tej:C22201 _1agw:C1717926
into _1agy [optimize]
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:355)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcesso
...


But then it finished. And now commits are OK again.

Anyone know what the merge hit exception means and if i lost anything?


Re: cannot allocate memory for snapshooter

2009-01-02 Thread Brian Whitman
Thanks for the pointer. (It seems really weird to alloc 5GB of swap just
because the JVM needs to run a shell script.. but I get hoss's explanation
in the following post)

On Fri, Jan 2, 2009 at 2:37 PM, Bill Au  wrote:

> add more swap space:
> http://www.nabble.com/Not-enough-space-to11423199.html#a11424938
>
> Bill
>
> On Fri, Jan 2, 2009 at 10:52 AM, Brian Whitman  wrote:
>
> > I have an indexing machine on a test server (a mid-level EC2 instance,
> 8GB
> > of RAM) and I run jetty like:
> >
> > java -server -Xms5g -Xmx5g -XX:MaxPermSize=128m
> > -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap
> > -Dsolr.solr.home=/vol/solr -Djava.awt.headless=true -jar start.jar
> >
> > The indexing master is set to snapshoot on commit. Sometimes (not always)
> > the snapshot fails with
> >
> > SEVERE: java.io.IOException: Cannot run program
> > "/vol/solr/bin/snapshooter":
> > java.io.IOException: error=12, Cannot allocate memory
> > at java.lang.ProcessBuilder.start(Unknown Source)
> >
> > Why would snapshooter need more than 2GB ram?  /proc/meminfo says (with
> > solr
> > running & nothing else)
> >
> > MemTotal:  7872040 kB
> > MemFree:   2018404 kB
> > Buffers: 67704 kB
> > Cached:2161880 kB
> > SwapCached:  0 kB
> > Active:3446348 kB
> > Inactive:  2186964 kB
> > SwapTotal:   0 kB
> > SwapFree:0 kB
> > Dirty:   8 kB
> > Writeback:   0 kB
> > AnonPages: 3403728 kB
> > Mapped:  12016 kB
> > Slab:37804 kB
> > SReclaimable:20048 kB
> > SUnreclaim:  17756 kB
> > PageTables:   7476 kB
> > NFS_Unstable:0 kB
> > Bounce:  0 kB
> > CommitLimit:   3936020 kB
> > Committed_AS:  5383624 kB
> > VmallocTotal: 34359738367 kB
> > VmallocUsed:   340 kB
> > VmallocChunk: 34359738027 kB
> >
>


Re: cannot allocate memory for snapshooter

2009-01-05 Thread Brian Whitman
On Sun, Jan 4, 2009 at 9:47 PM, Mark Miller  wrote:

> Hey Brian, I didn't catch what OS you are using on EC2 by the way. I
> thought most UNIX OS's were using memory overcommit - A quick search brings
> up Linux, AIX, and HP-UX, and maybe even OSX?
>
> What are you running over there? EC2, so Linux I assume?
>

This is on debian, a 2.6.21 x86_64 kernel.


lazily loading search components?

2009-02-08 Thread Brian Whitman
We have a standard solr install that we use across a lot of different uses.
In that install is a custom search component that loads a lot of data in its
inform() method. This means the data is initialized on solr boot. Only about
half of our installs actually ever call this search component, so the data
sits around eating up heap.
I could start splitting up our conf/ folders per solr install "type" but
that seems wrong. I'd like to instead configure my search component to not
have its inform() called until the first time it is actually called. Is this
possible?


general survey of master/replica setups

2009-02-23 Thread Brian Whitman
Say you have a bunch of solr servers that index new data, and then some
replica/"slave" setup that snappulls from the master on a cron or some
schedule. Live internet facing queries hit the replica, not the master, as
indexes/commits on the master slow down queries.
But even the query-only solr installs need to "snap-install" every so often,
triggering a commit, and there is a slowdown in queries when this happens.
Measured avg QTimes during normal times are 400ms, during commit/snapinstall
times they dip into the seconds. Say in the 5m between snappulls 1000
documents have been updated/deleted/added.

How do people mitigate the effect of the commit on replica query instances?


arcane queryParser parseException

2009-02-23 Thread Brian Whitman
server:/solr/select?q=field:"''anything can go here;" --> Lexical error,
encountered  after : "\"\'\'anything can go here"
server:/solr/select?q=field:"'anything' anything can go here;" --> Same
problem

server:/solr/select?q=field:"'anything' anything can go here\;" --> No
problem (but ClientUtils's escape does not escape semicolons.)

server:/solr/select?q=field:"anything can go here;" --> no problem

server:/solr/select?q=field:"''anything can go here" --> no problem

As far as I can tell, two apostrophes, then a semicolon causes the lexical
error. There can be text within the apostrophes. If you leave out the
semicolon it's ok. But you can keep the semicolon if you remove the two
apostrophes.

This is on trunk solr.


Re: arcane queryParser parseException

2009-02-24 Thread Brian Whitman
>
> : I went ahead and added it since it does not hurt anything to escape more
> : things -- it just makes the final string ugly.
>
> : In 1.3 the escape method covered everything:
>
> H good call, i didn't realize the escape method had been so
> blanket in 1.3.  this way we protect people who were using it in 1.3 and
> relied on it to protect them from the legacy ";" behavior.



Thanks hoss and ryan. That explains why the error was new to us-- we
upgraded to trunk from 1.3 release and this exception came from a solrj
processed query that used to work.


java.lang.NoSuchMethodError: org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map;

2009-02-24 Thread Brian Whitman
Seeing this in the logs of an otherwise working solr instance. Commits are
done automatically I believe every 10m or 1 docs. This is solr trunk
(last updated last night) Any ideas?



INFO: [] webapp=/solr path=/select
params={fl=thingID,n_thingname,score&q=n_thingname:"Cornell+Dupree"^5+net_thingname:"Cornell+Dupree"^4+ne_thingname:"Cornell+Dupree"^2&wt=standard&fq=s_type:artist&rows=10&version=2.2}
hits=2 status=0 QTime=37
Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
commit{dir=/vol/solr/data/index,segFN=segments_2cy,version=1224560226691,generation=3058,filenames=[_2yp.tvf,
_2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, segments_2cy, _2yp.tii,
_2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, _2yr.tii, _2yr.nrm, _2ys.tvd,
_2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, _2yn.tii, _2yn.fdt, _2yq.prx,
_2yo.tvd, _2yp.fnm, _2yo.tvf, _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx,
_2yp.prx, _2yn.tis, _2yq.nrm, _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx,
_2yq.tis, _2yo.fdx, _2yo.tii, _2yp.nrm, _2yq.tii, _2yr.frq, _2yr.prx,
_2yo.tis, _2yp.fdt, _2yq.frq, _2yp.fdx, _2yq.fnm, _2yo.tvx, _2ys.tii,
_2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, _2ys.tis, _2yr.tvd,
_2yn_9.del, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, _2yp.tvd]
commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf,
_2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del, segments_2cz,
_2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf,
_2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd,
_2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf,
_2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm,
_2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii,
_2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt,
_2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx,
_2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm,
_2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt,
_2yt.tii, _2yt.frq, _2yp.tvd]
Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: last commit = 1224560226692
Feb 24, 2009 5:05:53 PM org.apache.solr.search.SolrIndexSearcher 
INFO: Opening searc...@25ddfb6a main
Feb 24, 2009 5:05:53 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=1
commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf,
_2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del, segments_2cz,
_2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf,
_2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd,
_2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf,
_2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm,
_2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii,
_2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt,
_2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx,
_2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm,
_2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt,
_2yt.tii, _2yt.frq, _2yp.tvd]
Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: last commit = 1224560226692
Feb 24, 2009 5:05:53 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NoSuchMethodError:
org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map;
at org.apache.solr.search.FastLRUCache.getStatistics(FastLRUCache.java:244)
at org.apache.solr.search.FastLRUCache.toString(FastLRUCache.java:260)
at java.lang.String.valueOf(String.java:2827)
at java.lang.StringBuilder.append(StringBuilder.java:115)
at
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1645)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1147)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)

Feb 24, 2009 5:05:53 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to searc...@25ddfb6a main
Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={start=0&q=solr&rows=10} hits=0
status=0 QTime=2
Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null params={start=0&q=rocks&rows=10} hits=0
status=0 QTime=0
Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=null path=null
params={q=stat

Re: java.lang.NoSuchMethodError: org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map;

2009-02-24 Thread Brian Whitman
Yep, did ant clean, made sure all the solr-libs were current, no more
exception. Thanks ryan & mark



On Tue, Feb 24, 2009 at 1:47 PM, Ryan McKinley  wrote:

> i hit that one too!
>
> try: ant clean
>
>
>
> On Feb 24, 2009, at 12:08 PM, Brian Whitman wrote:
>
>  Seeing this in the logs of an otherwise working solr instance. Commits are
>> done automatically I believe every 10m or 1 docs. This is solr trunk
>> (last updated last night) Any ideas?
>>
>>
>>
>> INFO: [] webapp=/solr path=/select
>>
>> params={fl=thingID,n_thingname,score&q=n_thingname:"Cornell+Dupree"^5+net_thingname:"Cornell+Dupree"^4+ne_thingname:"Cornell+Dupree"^2&wt=standard&fq=s_type:artist&rows=10&version=2.2}
>> hits=2 status=0 QTime=37
>> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onCommit
>> INFO: SolrDeletionPolicy.onCommit: commits:num=2
>>
>> commit{dir=/vol/solr/data/index,segFN=segments_2cy,version=1224560226691,generation=3058,filenames=[_2yp.tvf,
>> _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, segments_2cy, _2yp.tii,
>> _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, _2yr.tii, _2yr.nrm, _2ys.tvd,
>> _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, _2yn.tii, _2yn.fdt, _2yq.prx,
>> _2yo.tvd, _2yp.fnm, _2yo.tvf, _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx,
>> _2yp.prx, _2yn.tis, _2yq.nrm, _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx,
>> _2yq.tis, _2yo.fdx, _2yo.tii, _2yp.nrm, _2yq.tii, _2yr.frq, _2yr.prx,
>> _2yo.tis, _2yp.fdt, _2yq.frq, _2yp.fdx, _2yq.fnm, _2yo.tvx, _2ys.tii,
>> _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, _2ys.tis, _2yr.tvd,
>> _2yn_9.del, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, _2yp.tvd]
>>
>> commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf,
>> _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del,
>> segments_2cz,
>> _2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf,
>> _2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd,
>> _2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf,
>> _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm,
>> _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii,
>> _2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt,
>> _2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx,
>> _2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm,
>> _2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt,
>> _2yt.tii, _2yt.frq, _2yp.tvd]
>> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> INFO: last commit = 1224560226692
>> Feb 24, 2009 5:05:53 PM org.apache.solr.search.SolrIndexSearcher 
>> INFO: Opening searc...@25ddfb6a main
>> Feb 24, 2009 5:05:53 PM org.apache.solr.update.DirectUpdateHandler2 commit
>> INFO: end_commit_flush
>> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onInit
>> INFO: SolrDeletionPolicy.onInit: commits:num=1
>>
>> commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf,
>> _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del,
>> segments_2cz,
>> _2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf,
>> _2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd,
>> _2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf,
>> _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm,
>> _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii,
>> _2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt,
>> _2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx,
>> _2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm,
>> _2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt,
>> _2yt.tii, _2yt.frq, _2yp.tvd]
>> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> INFO: last commit = 1224560226692
>> Feb 24, 2009 5:05:53 PM org.apache.solr.common.SolrException log
>> SEVERE: java.lang.NoSuchMethodError:
>>
>> org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map;
>> at
>> org.apache.solr.search.FastLRUCache.getStatistics(FastLRUCache.java:244)
>> at org.apache.solr.search.FastLRUCache.toString(FastLRUCache.java:260)
>> at java.lang.String.valueOf(String.java:2827)
>> at java.lang.StringBuilder.append(StringBuilder.java:115)
>> at
>> org.apache.solr.

maxCodeLength in PhoneticFilterFactory

2009-04-10 Thread Brian Whitman
i have this version of solr running:

Solr Implementation Version: 1.4-dev 747554M - bwhitman - 2009-02-24
16:37:49

and am trying to update a schema to support 8 code length metaphone instead
of 4 via this (committed) issue:

https://issues.apache.org/jira/browse/SOLR-813

So I change the schema to this (knowing that I have to reindex)



But when I do queries fail with

Error_initializing_DoubleMetaphoneclass_orgapachecommonscodeclanguageDoubleMetaphone__at_orgapachesolranalysisPhoneticFilterFactoryinitPhoneticFilterFactoryjava90__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava821__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava817__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava149__at_orgapachesolrschemaIndexSchemareadAnalyzerIndexSchemajava831__at_orgapachesolrschemaIndexSchemaaccess$100IndexSchemajava58__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava425__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava410__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava141__at_orgapachesolrschemaIndexSchemareadSchemaIndexSchemajava452__at_orgapachesolrschemaIndexSchemainitIndexSchemajava95__at_orgapachesolrcoreSolrCoreinitSolrCorejava501__at_orgapachesolrcoreCoreContainer$InitializerinitializeCoreContainerjava121


Re: maxCodeLength in PhoneticFilterFactory

2009-04-12 Thread Brian Whitman
yep, that did it. Thanks very much yonik.


On Sat, Apr 11, 2009 at 10:27 PM, Yonik Seeley
wrote:

> OK, should hopefully be fixed in trunk.
>
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Sat, Apr 11, 2009 at 9:16 PM, Yonik Seeley  wrote:
> > There's definitely a bug - I just reproduced it.  Nothing obvious
> > jumps out at me... and there's no error in the logs either (that's
> > another bug it would seem).  Could you open a JIRA issue for this?
> >
> >
> > -Yonik
> > http://www.lucidimagination.com
> >
> >
> >
> > On Fri, Apr 10, 2009 at 6:54 PM, Brian Whitman 
> wrote:
> >> i have this version of solr running:
> >>
> >> Solr Implementation Version: 1.4-dev 747554M - bwhitman - 2009-02-24
> >> 16:37:49
> >>
> >> and am trying to update a schema to support 8 code length metaphone
> instead
> >> of 4 via this (committed) issue:
> >>
> >> https://issues.apache.org/jira/browse/SOLR-813
> >>
> >> So I change the schema to this (knowing that I have to reindex)
> >>
> >> encoder="DoubleMetaphone"
> >> inject="true" maxCodeLength="8"/>
> >>
> >> But when I do queries fail with
> >>
> >>
> Error_initializing_DoubleMetaphoneclass_orgapachecommonscodeclanguageDoubleMetaphone__at_orgapachesolranalysisPhoneticFilterFactoryinitPhoneticFilterFactoryjava90__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava821__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava817__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava149__at_orgapachesolrschemaIndexSchemareadAnalyzerIndexSchemajava831__at_orgapachesolrschemaIndexSchemaaccess$100IndexSchemajava58__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava425__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava410__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava141__at_orgapachesolrschemaIndexSchemareadSchemaIndexSchemajava452__at_orgapachesolrschemaIndexSchemainitIndexSchemajava95__at_orgapachesolrcoreSolrCoreinitSolrCorejava501__at_orgapachesolrcoreCoreContainer$InitializerinitializeCoreContainerjava121
> >>
> >
>


python response handler treats "unschema'd" fields differently

2009-04-17 Thread Brian Whitman
I have a solr index where we removed a field from the schema but it still
had some documents with that field in it.
Queries using the standard response handler had no problem but the
&wt=python handler would break on any query (with fl="*" or asking for that
field directly) with:

SolrHTTPException: HTTP code=400, reason=undefined_field_oldfield

I "fixed" it by putting that field back in the schema.

One related weirdness is that fl=oldfield would cause the exception but not
fl=othernonschemafield -- that is, it would only break on field names that
were not in schema but were in the documents.

I know this is undefined behavior territory but it was still weird that the
standard response writer does not do this-- if you give a nonexistent field
name to fl on wt=standard, either one that is in documents or is not -- it
happily performs the query just skipping the ones that are not in the
schema.


index time boosting on multivalued fields

2009-05-27 Thread Brian Whitman
I can set the boost of a field or doc at index time using the boost attr in
the update message, e.g.
pet

But that won't work for multivalued fields according to the RelevancyFAQ

pet
animal

( I assume it applies the last boost parsed to all terms? )

Now, say I'd like to do index-time boosting of a multivalued field with each
value having a unique boost. I could simply index the field multiple times:

pet
pet
animal

But is there a more exact way?


Re: Pagination of results and XSLT.

2007-07-23 Thread Brian Whitman
Has anyone tried to handle pagination of results using XSLT's ? I'm  
not really sure it is possible to do it in pure XSLT because all  
the response object gives us is a total document count - paginating  
the results would involve more than what XSLT 1.0 could handle  
(I'll be very happy if someone proves me wrong :)).




We do pagination in XSL 1.0 often -- direct from a solr response  
right to HTML/CSS/JS.
You get both the start and total rows from the solr response, so I  
don't know what else you'd need.


Here's a snip of a paging XSL in solr. The referred JS function  
pageResults just sets the &start= solr param.




  









   









0
0
0





	





	select="$startAt"/>















Re: Pagination of results and XSLT.

2007-07-24 Thread Brian Whitman


On Jul 24, 2007, at 5:20 AM, Ard Schrijvers wrote:



I have been using similar xsls like you describe below in the past,  
butI think after 3 years of using it I came to realize (500  
internal server error) that it can lead to nasty errors when you  
have a recursive call like (though I am not sure wether it depends  
on your xslt processor, al least xalan has the problem)


Yes -- in the public facing apps we have we limit the page counter to  
n + 10.


Not sure if this is a Solr thing to fix, I've been told many times  
never have solr go right out to xsl to html, so conceivably you'd  
have a "real" web app in between that can easily do paging.

-b







Re: boost field without dismax

2007-07-24 Thread Brian Whitman

 Jul 24, 2007, at 9:42 AM, Alessandro Ferrucci wrote:

is there a way to boost a field much like is done in dismax request
handler?  I've tried doing index-time boosting by providing the  
boost to the
field as an attribute in the add doc but that did nothing to  
affect the
score when I went to search.  I do not want to use dismax since I  
also want
wildcard patterns supported.  What I'd like to do is provide  
boosting of a

last-name field when I do a search.


something not like: firstname:alessandro lastname:ferrucci^5

?




Re: XML parsing error

2007-07-26 Thread Brian Whitman


On Jul 26, 2007, at 11:25 AM, Yonik Seeley wrote:


OK, then perhaps it's a jetty bug with charset handling.



I'm using resin btw



Could you run the same query, but use the python output?
wt=python



Seems to be OK:

{'responseHeader':{'status':0,'QTime':0,'params':{'start':'7','fl':'c
ontent','q':'"Pez"~1','rows':'1','wt':'python'}},'response':{'num
Found':5381,'start':7,'docs':[{'content':u'Akatsuki - PE\'Z \ufffd\uf
ffd\ufffd \ufffd\ufffd\ufffd \ufffd\ufffd\u04b3 | \ufffd\ufffd\ufffd\
ufffd\ufffd\ufffd\u0333 | \ufffd\u057a\ufffd\ufffd\ufffd\ufffd\ufffd
| \u0177\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd | >>> Akatsuki - PE\'Z \
ufffd\ufffd\ufffd \ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\u05e8\ufffd\u
fffd \ufffd|\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\u0438\ufffd  |\
ufffd \ufffd\ufffd\ufffd\ufffd\u016e\ufffd\ufffd  |\ufffd \ufffd\
u05b6\ufffd\ufffd\ufffd\ufffd  |\ufffd \ufffd\u057a\ufffd\ufffd\u
fffd\ufffd\ufffd  |\ufffd \ufffd\u00b8\ufffd\ufffd\ufffd\ufffd\uf
ffd  |\ufffd t\ufffd\u04fa\ufffd\ufffd\ufffd \ufffd\ufffd \ufffd\
ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd  |\ufffd \ufffd\ufffd\u
03f7\ufffd\ufffd\ufffd\ufffd  |\ufffd \u04f0\ufffd\ufffd\ufffd\uf
ffd\ufffd\ufffd  |\ufffd \ufffd\u03fc\ufffd\ufffd\ufffd\ufffd\uff
fd  |\ufffd \u0177\ufffd>\ufffd\ufffd\ufffd  |\ufffd \ufffd\u
03f8\ufffd\ufffd\ufffd\ufffd\ufffd  |\ufffd \ufffd\ufffd\u0475\uf
ffd\ufffd \u0177\ufffd>\ufffd\ufffd\ufffd > Various Artists[2005] >\u
fffd\ufffd Now Jazz 3 - That\'s What I Call Jazz \ufffd\ufffd> Akatsu
ki - PE\'Z \ufffd\ufffd\ufffd Akatsuki - PE\'Z \ufffd\ufffd\ufffd \uf
ffd \ufffd\ufffd \ufffd \ufffd\ufffd \ufffd \ufffd\ufffd \ufffd \ufff
d\ufffd\ufffd\ufffd\u05e8\ufffd\ufffd\ufffd\ufffd \ufffd\ufffdNow Jaz
z 3 - That\'s What I Call Jazz\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\u
fffd\u0773\ufffd\ufffd\ufffd\ufffd\u05a3\ufffd Various Artists[2005]
Akatsuki - PE\'Z \ufffd\ufffd\ufffd\ufffd\ufffd\u0231\ufffd\ufffd \uf
ffd\ufffd\ufffd\u01fb\u1fa1\ufffd\uccb9\ufffd\ufffd\ufffd\ufffd\u0231
\ufffd\u0138\ufffd\u02a3\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u
04b5\ufffd\ufffd\u02f8\u00f8\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\uff
fd\ufffd\ufffd\u04f8\u00f8\ufffd\ufffd>>> \ufffd\ufffd\ufffd\ufffd\uf
ffd\ufffd\u04bb\ufffd\ufffd\ufffd\ufffd\ud8e1'}]}}






Re: XML parsing error

2007-07-26 Thread Brian Whitman


On Jul 26, 2007, at 11:10 AM, Yonik Seeley wrote:



If the '<' truely got destroyed, it's a server (Solr or Jetty) bug.

One possibility is that the '<' does exist, but due to a charset
mismatch, it's being slurped into a multi-byte char.



Just dumped it with curl and did a hexdump:

5a0   t   ;   &   g   t   ;   &   g   t   ; 357 277 275 357 277
5b0 275 357 277 275 357 277 275 357 277 275 357 277 275 322 273 357
5c0 277 275 357 277 275 357 277 275 357 277 275 361 210 220 274   /
5d0   s   t   r   >   <   /   d   o   c   >   <   /   r   e   s   u
5e0   l   t   >  \n   <   /   r   e   s   p   o   n   s   e   >  \n
5f0


No < in the response.





XML parsing error

2007-07-26 Thread Brian Whitman

I ended up with this doc in solr:



0name="QTime">17name="fl">content"Pez"~1name="rows">1numFound="5381" start="7">Akatsuki - PE'Z
ҳ | ̳ | պ | ŷ | >>> Akatsuki - PE'Z   ר | и  |  
Ů  | ֶ  | պ  | ¸  | tӺ
 | Ϸ  | Ӱ  | ϼ  | ŷ>  
 | ϸ  | ѵ ŷ> > Various Artists[2005] > Now  
Jazz 3 - That's What I Call Jazz > Akatsuki - PE'Z  Akatsuki -  
PE'Z ר Now Jazz 3 - That's What I Call Jazz ݳ֣ Various  
Artists[2005] Akatsuki - PE'Z ȱ ǻᾡ첹ȱĸʣ ҵ˸ø  
Ӹø>>> һ񈐼/str>




Note the missing < in 

Solrj throws this (on a larger query that includes this doc):
Caused by: javax.xml.stream.XMLStreamException: ParseError at  
[row,col]:[3,20624]
Message: The element type "str" must be terminated by the matching  
end-tag "".


And firefox can't render it either, throws an error.

So any query that returns this doc will cause an error.

Obviously there's some weird stuff in this doc, but is it a solr  
issue that the < got destroyed?





Re: XML parsing error

2007-07-26 Thread Brian Whitman


On Jul 26, 2007, at 11:49 AM, Yonik Seeley wrote:



Could you try it with jetty to see if it's the servlet container?
It should be simple to just copy the index directory into solr's
example/solr/data directory.



Yonik, sorry for my delay, but I did just try this in jetty -- it  
works (it doesn't throw an error, and the < in 




BTW, is the fact that the content is full of \uFFFD a problem?  That
looks to be the unicode replacement character, meaning that the real
characters were lost somewhere along the line?  Or is this some sort
of private (non-standard) encoding?


Certainly nothing I know about -- this particular index is from nutch  
crawls injected with solrj... so who knows.


I'll look into what I can with Resin's issue. For now I'm going to  
delete that doc and see if I can find any others.


-b



Re: Any clever ideas to inject into solr? Without http?

2007-08-09 Thread Brian Whitman


On Aug 9, 2007, at 11:12 AM, Kevin Holmes wrote:




2: Is there a way to inject into solr without using POST / curl /  
http?




Check http://wiki.apache.org/solr/EmbeddedSolr

There's examples in java and cocoa to use the DirectSolrConnection  
class, querying and updating solr w/o a web server. It uses JNI in  
the Cocoa case.

-b



Re: Python Utilitys for Solr

2007-08-14 Thread Brian Whitman


On Aug 14, 2007, at 5:16 AM, Christian Klinger wrote:


Hi

i just play a bit with:
http://svn.apache.org/repos/asf/lucene/solr/trunk/client/python/ 
solr.py


Is it possible that this library is a bit out of date?


If i try to get the example running. I got a parese error from the  
result. Maybe the response format form Solr has changed?




Yes, check this JIRA for some issues:

https://issues.apache.org/jira/browse/SOLR-216




Re: Indexing a URL

2007-09-05 Thread Brian Whitman


It is apparently attempting to parse &en=499af384a9ebd18f in the  
URL.  I am
not clear why it would do this as I specified indexed="false."  I  
need to

store this because that is how the user gets to the original article.


the ampersand is an XML reserved character. you have to escape it  
(turn it into &), whether you are indexing the data or not.  
Nothing to do w/ Solr, just xml files in general. Whatever you're  
using to render the xml should be able to handle this for you.





Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Brian Whitman


On Sep 10, 2007, at 1:33 AM, Adrian Sutton wrote:
After a while we start getting exceptions thrown because of a  
timeout in acquiring write.lock. It's quite possible that this  
occurs whenever two updates are attempted at the same time - is  
DirectSolrConnection intended to be thread safe?




We use DirectSolrConnection via JNI in a couple of client apps that  
sometimes have 100s of thousands of new docs as fast as Solr will  
have them. It would crash relentlessly if I didn't force all calls to  
update or query to be on the same thread using objc's @synchronized  
and a message queue. I never narrowed down if this was a solr issue  
or a JNI one.








Re: DirectSolrConnection, write.lock and Too Many Open Files

2007-09-10 Thread Brian Whitman


On Sep 10, 2007, at 5:00 PM, Mike Klaas wrote:


On 10-Sep-07, at 1:50 PM, Adrian Sutton wrote:

We use DirectSolrConnection via JNI in a couple of client apps  
that sometimes have 100s of thousands of new docs as fast as Solr  
will have them. It would crash relentlessly if I didn't force all  
calls to update or query to be on the same thread using objc's  
@synchronized and a message queue. I never narrowed down if this  
was a solr issue or a JNI one.


That doesn't sound promising. I'll throw in synchronization around  
the update code and see what happens. That's doesn't seem good for  
performance though. Can Solr as a web app handle multiple updates  
at once or does it synchronize to avoid it?


Solr can handle multiple simultaneous updates.  The entire request  
processing is concurrent, as is the document analysis.  Only the  
final write is synchronized (this includes lucene segment merging).





Yes, i do want to disclaim that it's very likely my thread problems  
are an implementation detail w/ JNI, nothing to do w/ DSC.


-b




Re: Term extraction

2007-09-19 Thread Brian Whitman

On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote:

I'm currently looking at methods of term extraction and automatic  
keyword

generation from indexed documents.


We do it manually (not in solr, but we put the results in solr.) We  
do it the usual way - chunk (into n-grams, named entities & noun  
phrases) and count (tf & df). It works well enough. There is a bevy  
of literature on the topic if you want to get "smart" -- but be  
warned smart and fast are likely not very good friends.


A lot depends on the provenance of your data -- is it clean text that  
uses a lot of domain specific terms? Is it webtext?




logging bad stuff separately in resin

2007-09-22 Thread Brian Whitman
We have a largish solr index that handles roughly 200K new docs a day  
and also roughly a million queries a day from other programs. It's  
hosted by resin.


A couple of times in the past few weeks something "bad" has happened  
-- a lock error or file handle error, or maybe a required field  
wasn't being sent by the indexer for some reason. We want to be able  
to know about this stuff asap without having to stare at the huge  
resin log all day.


Is there a way to filter the log that goes into resin by "bad/fatal"  
stuff separate from the usual request logging?  I would like to put  
the solr errors somewhere else so it's more maintainable.






Re: Term extraction

2007-09-22 Thread Brian Whitman


On Sep 21, 2007, at 3:37 AM, Pieter Berkel wrote:


Thanks for the response guys:

Grant: I had a brief look at LingPipe, it looks quite interesting  
but I'm
concerned that the licensing may prevent me from using it in my  
project.




Does the opennlp license look good for you? It's LGPL. Not all the  
features of lingpipe but it works pretty well.  https:// 
sourceforge.net/projects/opennlp/





Re: Nutch with SOLR

2007-09-25 Thread Brian Whitman


Sami has a patch in there which used a older version of the solr  
client. with the current solr client in the SVN tree, his patch  
becomes much easier.
your job would be to upgrade the patch and mail it back to him so  
he can update his blog, or post it as a patch for inclusion in  
nutch/contrib (if sami is ok with that). If you have issues with  
how to use the solr client api, solr-user is here to help.




I've done this. Apparently someone else has taken on the solr-nutch  
job and made it a bit more complicated (which is good for the long  
term) than sami's original patch -- https://issues.apache.org/jira/ 
browse/NUTCH-442


But we still use a version of Sami's patch that works on both trunk  
nutch and trunk solr (solrj.) I sent my changes to sami when we did  
it, if you need it let me know...



-b




Re: Nutch with SOLR

2007-09-25 Thread Brian Whitman


But we still use a version of Sami's patch that works on both trunk  
nutch and trunk solr (solrj.) I sent my changes to sami when we did  
it, if you need it let me know...




I put my files up here: http://variogr.am/latest/?p=26

-b



Re: Nutch with SOLR

2007-09-26 Thread Brian Whitman


On Sep 26, 2007, at 4:04 AM, Doğacan Güney wrote:


NUTCH-442 is one of the issues that I want to really see resolved.
Unfortunately, I haven't received many (as in, none) comments, so I
haven't made further progress on it.



I am probably your target customer but to be honest all we care about  
is using Solr to index, not for any of the searching or summary stuff  
in Nutch. Is there a way to get Sami's SolrIndexer in nutch trunk  
(now that it's working OK) sooner than later and keep working on  
NUTCH-442 as well? Do they conflict? -b





searching for non-empty fields

2007-09-26 Thread Brian Whitman
I have a large index with a field for a URL. For some reason or  
another, sometimes a doc will get indexed with that field blank. This  
is fine but I want a query to return only the set URL fields...


If I do a query like:

q=URL:[* TO *]

I get a lot of empty fields back, like:



http://thing.com

What I can query for to remove the empty fields?





Re: searching for non-empty fields

2007-09-27 Thread Brian Whitman

thanks Peter, Hoss and Ryan..


q=(URL:[* TO *] -URL:"")


This gives me 400 Query parsing error: Cannot parse '(URL:[* TO *] - 
URL:"")': Lexical error at line 1, column 29. Encountered: "\"" (34),  
after : "\""




adding something like:
  


I'll do this but the problem here is I have to wait around for all  
these docs to re-index..


Your query will work if you make sure the URL field is omitted from  
the

document at index time when the field is blank.


The thing is, I thought I was omitting the field if it's blank. It's  
in a solrj instance that takes a lucenedocument, so maybe it's a  
solrj issue?


   if( URL != null && URL.length() > 5 )
  doc.add(new Field("URL", URL, Field.Store.YES,  
Field.Index.UN_TOKENIZED));


And then during indexing:

SimpleSolrDoc solrDoc = new SimpleSolrDoc();
solrDoc.setBoost( null, new Float ( doc.getBoost()));
for (Enumeration e = doc.fields(); e.hasMoreElements();) {
  Field field = e.nextElement();
  if (!ignoreFields.contains((field.name( {
solrDoc.addField(field.name(), field.stringValue());
  }
}
try {
  solr.add(solrDoc);
...







small rsync index question

2007-09-28 Thread Brian Whitman
I'm not using snap* scripts but i quickly need to sync up two indexes  
on two machines. I am rsyncing the data dirs from A to B, which work  
fine. But how can I see the new index on B? For some reason sending a  
 is not refreshing the index, and I have to restart resin to  
see it. Is there something else I have to do?




Re: small rsync index question

2007-09-28 Thread Brian Whitman

 Sep 28, 2007, at 5:41 PM, Yonik Seeley wrote:


It should... are there any errors in the logs?  do you see the commit
in the logs?
Check the stats page to see info about when the current searcher was
last opened too.



ugh, nevermind.. was committing the wrong solr index... but Thanks  
yonik for the response


But luckily I can try save face with a followon question :)

I regularly see

file has vanished: "/dir/solr/data/index/segments_3aut"

when rsyncing, and when that happens i get an error on the rsync'd  
copy. The index I am rsyncing is large (50GB) and very active, is  
constantly getting new docs and searched on. What can I do to  
preserve the index state while syncing?








dismax downweighting

2007-10-12 Thread Brian Whitman
i have a dismax query where I want to boost appearance of the query  
terms in certain fields but "downboost" appearance in others.


The practical use is a field containing a lot of descriptive text and  
then a product name field where products might be named after a  
descriptive word. Consider an electric toothbrush called "The Fast  
And Thorough Toothbrush" -- if a user searches for fast toothbrush  
I'd like to down-weight that particular model's advantage. The name  
of the product might also be in the descriptive text.


I tried

 
-name description
 

but solr didn't like that.

Any better ideas?


--
http://variogr.am/





Lock obtain timed out

2007-10-18 Thread Brian Whitman
We have a very active large index running a solr trunk from a few  
weeks ago that has been going down about once a week for this:


[11:08:17.149] No lockType configured for /home/bwhitman/XXX/XXX/ 
discovered-solr/data/index assuming 'simple'
[11:08:17.150] org.apache.lucene.store.LockObtainFailedException:  
Lock obtain timed out: SimpleFSLock@/home/bwhitman/XXX/XXX/discovered- 
solr/data/index/lucene-5b07ebeb7d53a4ddc5a950a458af4acc-write.lock

[11:08:17.150]  at org.apache.lucene.store.Lock.obtain(Lock.java:70)
[11:08:17.150]  at org.apache.lucene.index.IndexWriter.init 
(IndexWriter.java:598)
[11:08:17.150]  at org.apache.lucene.index.IndexWriter. 
(IndexWriter.java:410)
[11:08:17.150]  at org.apache.solr.update.SolrIndexWriter. 
(SolrIndexWriter.java:97)
[11:08:17.150]  at  
org.apache.solr.update.UpdateHandler.createMainIndexWriter 
(UpdateHandler.java:121)
[11:08:17.150]  at  
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandl


We have the following in our solrconfig re: locks:

1000
1
false


What can I do to mitigate this problem? Removing the lock file and  
restarting resin solves it, but only temporarily.










Re: Lock obtain timed out

2007-10-18 Thread Brian Whitman

Thanks to ryan and matt.. so far so good.



true
 
  single
 


grouped clause search in dismax

2007-10-20 Thread Brian Whitman
I have a dismax handler to match product names found in free text  
that looks like:


  
  

 explicit
 0.01
 
name^5 nec_name^3 ne_name
 
 
   *
 
 100
 *:*

  

name is type string, nec_name and ne_name are special types that do  
domain-specific stopword removal, latin1 munging etc, all are  
confirmed working fine on their own.


Say I have a product called "SUPERBOT" and I want the text "I love  
SUPERBOT" to match the product SUPERBOT pretty high.


In Lucene or Solr on its own you'd do something like:

name:(I love SUPERBOT)^5 nec_name:(I love SUPERBOT)^3 ne_name:(I love  
SUPERBOT)


which works fine.  And so does:

qt=thing&q=SUPERBOT

But this doesn't work:

qt=thing&q=(I%20love%20SUPERBOT)

nor does

qt=thing&q=I%20love%20SUPERBOT

-- they get no results.

How can we do "grouped clause" queries in dismax?







Re: How to get number of indexed documents?

2007-11-01 Thread Brian Whitman

does http://.../solr/admin/luke work for you?


601818

...


On Nov 1, 2007, at 10:39 PM, Papalagi Pakeha wrote:


Hello,

Is there any way to get XML version of statistics like how many
documents are indexed etc?

I have found http://.../solr/admin/properties which is cool but
doesn't give me the number of indexed documents.

Thanks

PaPa


--
http://variogr.am/





"overlapping onDeckSearchers" message

2007-11-03 Thread Brian Whitman
I have a solr index that hasn't had many problems recently but I had  
the logs open and noticed this a lot during indexing:


[16:23:34.086] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Not sure what it means, google didn't come back with much. 


Re: start.jar -Djetty.port= not working

2007-11-07 Thread Brian Whitman


On Nov 7, 2007, at 10:00 AM, Mike Davies wrote:

java -Djetty.port=8521 -jar start.jar

However when I run this it seems to ignore the command and still  
start on

the default port of 8983.  Any suggestions?



Are you using trunk solr or 1.2? I believe 1.2 still shipped with an  
older version of jetty that doesn't follow the new-style CL  
arguments. I just tried it on trunk and it worked fine for me.








--
http://variogr.am/
[EMAIL PROTECTED]





Re: start.jar -Djetty.port= not working

2007-11-07 Thread Brian Whitman



On Nov 7, 2007, at 10:07 AM, Mike Davies wrote:

I'm using 1.2, downloaded from

http://apache.rediris.es/lucene/solr/

Where can i get the trunk version?


svn, or http://people.apache.org/builds/lucene/solr/nightly/




Re: LSA Implementation

2007-11-26 Thread Brian Whitman


On Nov 26, 2007 6:58 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
patented, so it is not likely to happen unless the authors donate the
patent to the ASF.

-Grant




There are many ways to catch a bird... LSA reduces to SVD on the TF  
graph. I have had limited success using JAMA's SVD, which is PD. It's  
pure java; for something serious you'd want to wrap the hard bits in  
MKL/Accelerate.


A more interesting solr related question is where a very heavy  
process like SVD would operate. You'd want to run the 'training' half  
of it separate from a indexing or querying. It'd almost be like an  
optimize. Is there any hook right now to give Solr a "command" like  
 and map it to the class in the solrconfig? The  
classify half of the SVD can happen at query or index time, very  
quickly, I imagine that could even be a custom field type.




Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Brian Whitman


On Nov 27, 2007, at 6:08 PM, bbrown wrote:

I couldn't tell if this was asked before.  But I want to perform a  
nutch crawl
without any solr plugin which will simply write to some index  
directory.  And
then ideally I would like to use solr for searching?  I am assuming  
this is

possible?



yes, this is quite possible. You need to have a solr schema that  
mimics the nutch schema, see sami's solrindexer for an example. Once  
you've got that schema, simply set the data dir in your solrconfig to  
the nutch index location and you'll be set.





Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Brian Whitman


On Nov 28, 2007, at 1:24 AM, Otis Gospodnetic wrote:

I only glanced at Sami's post recently and what I think I saw there  
is something different.  In other words, what Sami described is not  
a Solr instance pointing to a Nutch-built Lucene index, but rather  
an app that reads the appropriate Nutch/Hadoop files with fetched  
content and posts the read content to a Solr instance using a Solr  
java client like solrj.

No?



Yes, to be clear, all you need from Sami's thing is the schema file.  
Ignore everything else. Then point solr at the nutch index directory  
(it's just a lucene index.)


Sami's entire thing is for indexing with solr instead of nutch,  
separate issue...





Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 8:33:18 PM
Subject: Re: Solr and nutch, for reading a nutch index

On Tue, 27 Nov 2007 18:12:13 -0500
Brian Whitman <[EMAIL PROTECTED]> wrote:



On Nov 27, 2007, at 6:08 PM, bbrown wrote:


I couldn't tell if this was asked before.  But I want to perform a



nutch crawl
without any solr plugin which will simply write to some index
directory.  And
then ideally I would like to use solr for searching?  I am assuming



this is
possible?



yes, this is quite possible. You need to have a solr schema that
mimics the nutch schema, see sami's solrindexer for an example. Once



you've got that schema, simply set the data dir in your solrconfig to



the nutch index location and you'll be set.


I think you should keep an eye on the versions of Lucene library used
by both Nutch + Solr - differences at this layer *could* make them
incompatible - but I am not an expert...
B

_
{Beto|Norberto|Numard} Meijome

"Against logic there is no armor like ignorance."
 Laurence J. Peter

I speak for myself, not my employer. Contents may be hot. Slippery  
when

wet. Reading disclaimers makes you go blind. Writing them is worse.
You have been Warned.





--
http://variogr.am/





can I do *thing* substring searches at all?

2007-11-29 Thread Brian Whitman
With a fieldtype of string, can I do any sort of *thing* search? I  
can do thing* but not *thing or *thing*. Workarounds?







Re: Re:

2007-12-02 Thread Brian Whitman


On Dec 2, 2007, at 5:43 PM, Ryan McKinley wrote:



try \& rather then %26



or just put quotes around the whole url. I think curl does the right  
thing here.





Re: RE: Re:

2007-12-02 Thread Brian Whitman


On Dec 2, 2007, at 6:00 PM, Andrew Nagy wrote:


On Dec 2, 2007, at 5:43 PM, Ryan McKinley wrote:



try \& rather then %26



or just put quotes around the whole url. I think curl does the  
right thing here.


I tried all the methods: converting & to %26, converting & to \& and  
encapsulating the url with quotes.  All give the same error.


curl http://localhost:8080/solr/update/csv?header=true\&seperator=%7C 
\&encapsulator=%22\&commit=true\&stream.file=import/homes.csv


seperator -> separator ? Does that help?




Re: RE: Re:

2007-12-02 Thread Brian Whitman


On Dec 2, 2007, at 5:29 PM, Andrew Nagy wrote:

Sorry for not explaining my self clearly: I have header=true as you  
can see from the curl command and there is a header line in the csv  
file.



was this your actual curl request?


curl 
http://localhost:8080/solr/update/csv?header=true%26seperator=%7C%26encapsulator=%22%26commit=true%26stream.file=import/homes.csv




you're escaping the ampersands if so... just keep them as &







Re: out of heap space, every day

2007-12-04 Thread Brian Whitman


For faceting and sorting, yes.  For normal search, no.



Interesting you mention that, because one of the other changes since  
last week besides the index growing is that we added a sort to an  
sint field on the queries.


Is it reasonable that a sint sort would require over 2.5GB of heap on  
a 8M index? Is there any empirical data on how much RAM that will need?







out of heap space, every day

2007-12-04 Thread Brian Whitman
This maybe more of a general java q than a solr one, but I'm a bit  
confused.


We have a largish solr index, about 8M documents, the data dir is  
about 70G. We're getting about 500K new docs a week, as well as about  
1 query/second.


Recently (when we crossed about the 6M threshold) resin has been  
stopping with the following:


/usr/local/resin/log/stdout.log:[12:08:21.749] [28304] HTTP/1.1 500  
Java heap space
/usr/local/resin/log/stdout.log:[12:08:21.749]  
java.lang.OutOfMemoryError: Java heap space


Only a restart of resin will get it going again, and then it'll crash  
again within 24 hours.


It's a 4GB machine and we run it with args="-J-mx2500m -J-ms2000m" We  
can't really raise this any higher on the machine.


Are there 'native' memory requirements for solr as a function of  
index size? Does a 70GB index require some minimum amount of wired  
RAM? Or is there some mis-configuration w/ resin or solr or my  
system? I don't really know Java well but it seems strange that the  
VM can't page RAM out to disk or really do something else beside  
stopping the server.












Re: out of heap space, every day

2007-12-04 Thread Brian Whitman


int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms.
Then double that to allow for a warming searcher.



This is great, but can you help me parse this? Assume 8M docs and I'm  
sorting on an int field that is unix time (seonds since epoch.) For  
the purposes of the experiment assume every doc was indexed at a  
unique time.


so..

(int[800] + String[800], each term is 16 chars + 800*4) * 2

that's 384MB by my calculation. Is that right?




solrj - adding a SolrDocument (not a SolrInputDocument)

2007-12-06 Thread Brian Whitman

Writing a utility in java to do a copy from one solr index to another.
I query for the documents I want to copy:

SolrQuery q = new SolrQuery();
q.setQuery("dogs");
QueryResponse rq = source_solrserver.query(q);
for( SolrDocument d : rq.getResults() ) {
// now I want to add these to a new server after modifying it slightly
d.addField("newField", "somedata");
dest_solrserver.add(d);
}

but that doesn't work-- add wants a SolrInputDocument, not a  
SolrDocument. I can't cast or otherwise easily create the former from  
the latter. How could I do this sort of thing?







Re: solrj - adding a SolrDocument (not a SolrInputDocument)

2007-12-06 Thread Brian Whitman

On Dec 6, 2007, at 3:07 PM, Ryan McKinley wrote:
  public static SolrInputDocument toSolrInputDocument( SolrDocument  
d )

  {
SolrInputDocument doc = new SolrInputDocument();
for( String name : d.getFieldNames() ) {
  doc.addField( name, d.getFieldValue(name), 1.0f );
}
return doc;
  }

thanks, that worked! agree it's useful to have in clientutils...  
though I'm not sure why there needs to be two separate classes to  
begin with.





Re: Solr and Flex

2007-12-13 Thread Brian Whitman

On Dec 13, 2007, at 10:42 AM, jenix wrote:



I'm using Flex for the frontend interface and Solr on backend for  
the search
engine. I'm new to Flex and Flash and thought someone might have  
some code

integrating the two.



We've done light stuff querying solr w/ actionscript. It is pretty  
simple, you form your query as a url, get the url and then use AS's  
built in xml parser to get whatever you need. Haven't tried posting  
documents.







Re: debugging slowness

2007-12-20 Thread Brian Whitman


On Dec 20, 2007, at 11:02 AM, Otis Gospodnetic wrote:

Sounds like GC to me.  That is, the JVM not having large enough  
heap.  Run jconsole and you'll quickly see if this guess is correct  
or not (kill -QUIT is also your friend, believe it or not).


We recently had somebody who had a nice little Solr spellchecker  
instance running, but after awhile it would "stop responding".  We  
looked at the command-line used to invoke the servlet container and  
didn't see -Xmx. :)]


I'm giving resin args="-J-mx1m -J-ms5000m" (this is a amazon xtra- 
large instance w/ 16GB), it's using it


 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
2738 root  18   0 10.4g 9.9g 9756 S  231 66.0  48:07.66 java

After a restart yesterday and normal operation we haven't seen the  
problem creep back in yet. I might get my perl on to graph the query  
time and see if it's steadily increasing.


Can't run jconsole, no X at the moment, if need be I'll install it  
though...






Re: Status 500 - ParseError at [row,col]:[1,1] Message Content is not allowed in Prolog

2008-01-08 Thread Brian Whitman


On Jan 8, 2008, at 10:58 AM, Kirk Beers wrote:



curl http://localhost:8080/solr/update -H "Content-Type:text/xml" -- 
data-binary '/overwritePending="true">0001field>TitleIt was  
the best of times it was the worst of times blah blah blahdoc>'



Why the / after the first single quote?




Re: Status 500 - ParseError at [row,col]:[1,1] Message Content is not allowed in Prolog

2008-01-08 Thread Brian Whitman


I found that on the Wiki at http://wiki.apache.org/solr/UpdateXmlMessages#head-3dfbf90fbc69f168ab6f3389daf68571ad614bef 
  under the title: Updating a Data Record via curl. I removed it and  
now have the following:




0name="QTime">122This response format  
is experimental.  It is likely to change in the future.





Seems to be an error in the wiki. I changed it. Commit and you should  
see your test document in queries.





index out of disk space, CorruptIndexException

2008-01-14 Thread Brian Whitman
We had an index run out of disk space. Queries work fine but commits  
return


500 doc counts differ for segment _18lu: fieldsReader shows 104  
but segmentInfo shows 212


org.apache.lucene.index.CorruptIndexException: doc counts differ for  
segment _18lu: fieldsReader shows 104 but segmentInfo shows 212
	at  
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:191)


I've made room, restarted resin, and now solr won't start. No useful  
messages in the startup, just a


[21:01:49.105] Could not start SOLR. Check solr/home property
[21:01:49.105] java.lang.NullPointerException
[21:01:49.105]  at  
org 
.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java: 
100)


What can I do from here?







Re: index out of disk space, CorruptIndexException

2008-01-14 Thread Brian Whitman


On Jan 14, 2008, at 4:08 PM, Ryan McKinley wrote:

ug -- maybe someone else has better ideas, but you can try:
http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/index/CheckIndex.java


thanks for the tip, i did run that, but I stopped it 30 minutes in, as  
it was still on the first (out of 46) segment.. The index is (was)  
129GB.

I just restored to an older index and made this ticket, 
https://issues.apache.org/jira/browse/SOLR-455





Re: Missing Content Stream

2008-01-15 Thread Brian Whitman


On Jan 15, 2008, at 1:50 PM, Ismail Siddiqui wrote:


Hi Everyone,
I am new to solr. I am trying to index xml using http post as follows



Ismail, you seem to have a few spelling mistakes in your xml string.  
"fiehld, nadme" etc. (a) try fixing them, (b) try solrj instead, I  
agree w/ otis.






Re: best way to get number of documents in a Solr index

2008-01-15 Thread Brian Whitman


On Jan 15, 2008, at 3:47 PM, Maria Mosolova wrote:


Hello,

I am looking for the best way to get the number of documents in a  
Solr index. I'd like to do it from a java code using solrj.



  public int resultCount() {
try {
  SolrQuery q = new SolrQuery("*:*");
  QueryResponse rq = solr.query(q);
  return rq.getResults().getNumFound();
} catch (org.apache.solr.client.solrj.SolrServerException e) {
  System.err.println("Query problem");
} catch (java.io.IOException e)  {
  System.err.println("Other error");
}
return -1;
  }




Re: Newbie with Java + typo

2008-01-21 Thread Brian Whitman


On Jan 21, 2008, at 11:13 AM, Daniel Andersson wrote:
Well, no. "Immutable Page", and as far as I know (english not being  
my mother tongue), that means I can't edit the page



You need to create an account first.


Re: SolrPhpClient with example jetty

2008-01-22 Thread Brian Whitman


$document->title = 'Some Title';
	$document->content = 'Some content for this wonderful document.  
Blah blah blah.';




did you change the schema? There's no title or content field in the  
default example schema. But I believe solr does output different  
errors for that.





Re: Cache size clarification

2008-01-28 Thread Brian Whitman


On Jan 28, 2008, at 6:05 PM, Alex Benjamen wrote:
I need some clarification on the cache size parameters in the  
solrconfig. Suppose I'm using these values:



A lot of this is here: http://wiki.apache.org/solr/SolrCaching



Re: SEVERE: java.lang.OutOfMemoryError: Java heap space

2008-01-28 Thread Brian Whitman

On Jan 28, 2008, at 7:06 PM, Leonardo Santagada wrote:


On 28/01/2008, at 20:44, Alex Benjamen wrote:




I could allocate more physical memory, but I can't seem to increase  
the -Xmx option to 3800 I get
an error : "Could not reserve enough space for object heap", even  
though I have more than 4Gb free.
(We're running on Intel quad core 64bit) When I try strace I'm  
seeing mmap2 errors.

]

I don't know much about java... but can you get any program to map  
more than 4gb of memory? I know windows has hard limits on how much  
memory you can map to one process and linux I think has some limit  
too. Of course it can be configured but maybe it is just a system  
configuration problem.


We use 10GB of ram in one of our solr installs. You need to make sure  
your java is 64 bit though.  Alex, what does your java -version show?  
Mine shows


java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_03-b05, mixed mode)

And I run it with -mx1m -ms5000m



Re: SEVERE: java.lang.OutOfMemoryError: Java heap space

2008-01-28 Thread Brian Whitman


But on Intel, where I'm having the problem it  shows:
java version "1.6.0_10-ea"
Java(TM) SE Runtime Environment (build 1.6.0_10-ea-b10)
Java HotSpot(TM) Server VM (build 11.0-b09, mixed mode)

I can't seem to find the Intel 64 bit JDK binary, can you pls. send  
me the link?

I was downloading from here:
http://download.java.net/jdk6/




Install the AMD64 version. (Confusingly, AMD64 is a spec name for  
EM64T, which is now what both AMD and Intel use)
If that still doesn't work, is it possible that your machine/kernel is  
not set up to support 64 bit?




date math syntax

2008-01-29 Thread Brian Whitman
Is there a wiki page or more examples of the "date math" parsing other  
than this:


http://www.mail-archive.com/solr-user@lucene.apache.org/msg01563.html

out there somewhere? From an end user query perspective. -b



Re: Converting Solr results to java query/collection/map object

2008-02-19 Thread Brian Whitman

On Feb 19, 2008, at 3:08 PM, Paul Treszczotko wrote:


Hi,
I'm pretty new to SOLR and I'd like to ask your opinion on the best  
practice for converting XML results you get from SOLR into something  
that is better fit to display on a webpage. I'm looking for  
performance and relatively small footprint, perhaps ability to  
paginate thru the result set and display/process N results at a  
time. Any ideas? Any tutorials you can point me to? Thanks!




Paul, this is what solrj is for.

SolrQuery q = new SolrQuery();
q.setRows(10);
q.setStart(40);
q.setQuery("type:dogs");
QueryResponse rq = solrServer.query(q);
for( SolrDocument d : rq.getResults() ) {
String dogname = (String)d.getFieldValue("name");
...



will hardlinks work across partitions?

2008-02-23 Thread Brian Whitman
Will the hardlink snapshot scheme work across physical disk  
partitions? Can I snapshoot to a different partition than the one  
holding the live solr index?


can I form a SolrQuery and query a SolrServer in a request handler?

2008-02-25 Thread Brian Whitman
I'm in a request handler: public void  
handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)  {


And in here i want to form a SolrQuery based on the req, query the  
searcher and return results.


But how do I get a SolrServer out of the req? I can get a  
SolrIndexSearcher but that doesn't seem to let me pass in a SolrQuery.


I need a SolrQuery because I am forming a dismax query with a function  
query, etc...





Re: can I form a SolrQuery and query a SolrServer in a request handler?

2008-02-25 Thread Brian Whitman


Perhaps back up and see if we can do this a simpler way than a request
handler...
What is the query structure you are trying to generate?



I have two dismax queries defined in a solrconfig. Something like

  
...
 
   raw^4 name^1
 
  

  
...
 
   tags^3 type^2
 
  

They work fine on their own, and we often use &bf=sortable^... to  
change the ordering. But we want to merge them. Result IDs that show  
up in both need to go higher and with a url param we need to weight  
between the two. So I am making a /combined requesthandler that takes  
the query, the weights between the two and the value of the  
bf=sortable boost.


My handler: /combined?q=kittens&q1=0.5&q2=0.8&bfboost=2.0

Would query ?qt=q1&q=kittens&bf=2&fl=id, then ? 
qt=q2&q=kittens&bf=2&fl=id. The request handler would return the  
results of a term query with the (q1 returned IDs)^0.5 (q2 returned  
IDs)^0.8.









Re: can I form a SolrQuery and query a SolrServer in a request handler?

2008-02-25 Thread Brian Whitman
Would query ?qt=q1&q=kittens&bf=2&fl=id, then ? 
qt=q2&q=kittens&bf=2&fl=id.


Sorry, I meant:

 ?qt=q1&q=kittens&bf=sortable^2&fl=id, then ? 
qt=q2&q=kittens&bf=sortable^2&fl=id




invalid XML character

2008-03-01 Thread Brian Whitman

Once in a while we get this

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[4,790470]
[14:32:21.877] Message: An invalid XML character (Unicode: 0x6) was  
found in the element content of the document.
[14:32:21.877] 	at  
com 
.sun 
.org 
.apache 
.xerces 
.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588)
[14:32:21.877] 	at  
org 
.apache 
.solr 
.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java: 
318)
[14:32:21.877] 	at  
org 
.apache 
.solr 
.handler 
.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)

...

Our data comes from all sorts of places and although we've tried to be  
utf8 wherever we can, there are still cracks.


I would much rather a document get added with replacement character  
than to have this error prevent the addition of 8K documents (as has  
happened here, this one character was in a 8K ..and only the ones before this character were added.)


Is there something I can do on the solr side to ignore/replace invalid  
characters?








  1   2   >