Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Liam O'Boyle
Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after 
> the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>


Re: Distance sorting with spatial filtering

2010-09-10 Thread Lance Norskog
Since no one has jumped in to give the right syntax- yeah, it's a bug.
Please file a JIRA.

On Thu, Sep 9, 2010 at 9:44 PM, Scott K  wrote:
> On Thu, Sep 9, 2010 at 21:00, Lance Norskog  wrote:
>> I just checked out the trunk, and branch 3.x This query is accepted on both,
>> but gives no responses:
>> http://localhost:8983/solr/select/?q=*:*&sort=dist(2,x_dt,y_dt,0,0)+asc
>
> So you are saying when you add the sort parameter you get no results
> back, but do not get the error I am seeing? Should I open a Jira
> ticket?
>
>> x_dt and y_dt are wildcard fields with the tdouble type. "tdouble"
>> explicitly says it is stored and indexed. Your 'longitude' and 'latitude'
>> fields may not be stored?
>
> No, they are stored.
> http://localhost:8983/solr/select?q=*:*&rows=1&wt=xml&indent=true
> 
> 
> 
>  0
>  9
> 
> 
>  
> ...
>    47.6636
>    -122.3054
>
>
>> Also, this is accepted on both branches:
>> http://localhost:8983/solr/select/?q=*:*&sort=sum(1)+asc
>>
>> The documentation for sum() does not mention single-argument calls.
>
> This also fails
> http://localhost:8983/solr/select/?q=*:*&sort=sum(1,2)+asc
> http://localhost:8983/solr/select/?q=*:*&sort=sum(latitude,longitude)+asc
>
>
>> Scott K wrote:
>>>
>>> According to the documentation, sorting by function has been a feature
>>> since Solr 1.5. It seems like a major regression if this no longer
>>> works.
>>> http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
>>>
>>> The _val_ trick does not seem to work if used with a query term,
>>> although I can try some more things to give 0 value to the query term.
>>>
>>> On Wed, Sep 8, 2010 at 22:21, Lance Norskog  wrote:
>>>

 It says that the field "sum(1)" is not indexed. You don't have a field
 called 'sum(1)'. I know there has been a lot of changes in query parsing,
 and sorting by functions may be on the list. But the _val_ trick is the
 older one and, and you noted, still works. The _val_ trick sets the
 ranking
 value to the output of the function, thus indirectly doing what sort=
 does.

 Lance

 Scott K wrote:

>
> I get the error on all functions.
> GET 'http://localhost:8983/solr/select?q=*:*&sort=sum(1)+asc'
> Error 400 can not sort on unindexed field: sum(1)
>
> I tried another nightly build from today, Sep 7th, with the same
> results. I attached the schema.xml
>
> Thanks for the help!
> Scott
>
> On Wed, Sep 1, 2010 at 18:43, Lance Norskog    wrote:
>
>
>>
>> Post your schema.
>>
>> On Mon, Aug 30, 2010 at 2:04 PM, Scott K    wrote:
>>
>>
>>>
>>> The new spatial filtering (SOLR-1586) works great and is much faster
>>> than fq={!frange. However, I am having problems sorting by distance.
>>> If I try
>>> GET
>>>
>>> 'http://localhost:8983/solr/select/?q=*:*&sort=dist(2,latitude,longitude,0,0)+asc'
>>> I get an error:
>>> Error 400 can not sort on unindexed field:
>>> dist(2,latitude,longitude,0,0)
>>>
>>> I was able to work around this with
>>> GET 'http://localhost:8983/solr/select/?q=*:* AND _val_:"recip(dist(2,
>>> latitude, longitude, 0,0),1,1,1)"&fl=*,score'
>>>
>>> But why isn't sorting by functions working? I get this error with any
>>> function I try to sort on.This is a nightly trunk build from Aug 25th.
>>> I see SOLR-1297 was reopened, but that seems to be for edge cases.
>>>
>>> Second question: I am using the LatLonType from the Spatial Filtering
>>> wiki, http://wiki.apache.org/solr/SpatialSearch
>>> Are there any distance sorting functions that use this field, or do I
>>> need to have three indexed fields, store_lat_lon, latitude, and
>>> longitude, if I want both filtering and sorting by distance.
>>>
>>> Thanks, Scott
>>>
>>>
>>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>>
>>


>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Date faceting +1MONTH problem

2010-09-10 Thread Jan Høydahl / Cominvent
Just attended a talk at JavaZone (www.javazone.no) by Stephen Colebourne about 
JSR-310 which will make these kind of operations easier in future JDK, and how 
Joda-Time goes a great way of enabling it today. I'm not saying it would fix 
your GAP issue, as it's all about what definition of "month" we want to have in 
Solr, which may also vary between usecases. For faceting it makes perfect sense 
to follow calendar months.

Wouldn't this behaviour occur between 31 and 30 days months as well, e.g. 
2009-08-31 -> 2009-09-30 -> 2009-10-30 ?

For your particular usecase I would consider a workaround where you add a new 
year-month field for this kind of faceting, avoiding date math completely since 
it will be a string facet, giving you:

[2008-12] => 0
[2009-01] => 0
[2009-02] => 0
[2009-03] => 0
[2009-04] => 0

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 10. sep. 2010, at 01.20, Liam O'Boyle wrote:

> Hi Chris,
> 
> Yes, I saw the facet.range.include feature and briefly tried to implement it
> before realising that it was Solr 3.1 only :)  I agree that it seems like
> the best solution to problem.
> 
> Reindexing with a +1MILLI hack had occurred to me and I guess that's what
> I'll do in the meantime; it just seemed like something that people must have
> run into before!  I suppose it depends on the granularity of your
> timestamps; all of my values are actually just dates, so I've been putting
> them in as the date with T00:00:00.000Z, which makes the overlap problem
> very obvious.
> 
> If anyone else has come across a solution for this, feel free to suggest
> another approach, otherwise it's reindexing time.
> 
> Cheers,
> Liam
> 
> 
> On Fri, Sep 10, 2010 at 8:38 AM, Chris Hostetter
> wrote:
> 
>> : I'm trying to break down the data over a year into facets by month; to
>> avoid
>> : overlap, I'm using -1MILLI on the start and end dates and using a gap of
>> : +1MONTH.
>> :
>> : However, it seems like February completely breaks my monthly cycles,
>> leading
>> 
>> Yep.
>> 
>> Everything you posted makes sense to me in how DateMath works - "Jan 31 @
>> 23:59.999" + "1 MONTH" results in "Feb 28 @ 23:59.999" ... at which point
>> adding "1 MONTH" to that results in "Mar 28 @ ..." because there is no
>> context of what the initial starting point was.
>> 
>> It's not a situation i've ever personally run into ... one workarround
>> would be to use a "+1MILLI" fudge factor at indexing time, instead of a
>> "-1MILLI" fudge factor at query time ... that shouldn't have this problem.
>> 
>> If you'd like to open a bug to trak this, I think it might be possible to
>> fix this behavior (there are some features in the Java calendaring code
>> that make things like "Jan 31 + 2 Months" do the right thing) but
>> personally I think working on SOLR-1896 (combined with the new
>> facet.range.include param) is a more effective use of time so
>> we can eliminate the need for this type of hack completely in future Solr
>> releases.
>> 
>> -Hoss
>> 
>> --
>> http://lucenerevolution.org/  ...  October 7-8, Boston
>> http://bit.ly/stump-hoss  ...  Stump The Chump!
>> 
>> 



Solr CoreAdmin create ignores dataDir Parameter

2010-09-10 Thread Frank Wesemann

Hello,
if I am trying to create a new SolrCore based on an extisting one via 
the CoreAdmin HTTP API,


http://localhost:8983/solr/admin/cores?action=CREATE&name=newCore&instanceDir=old_instance&schema=newSchema.xml&dataDir=newdata 



the dataDir parameter is ignored.
Instead the dataDir from the solrconfig.xml is taken in account.

I had a look at the Sources and saw that the CoreContainer's create() 
method,
calls the SolrCore Construktor with an dataDir value of "null", which 
leads to a dataDir primarily read from the config and not from the 
CoreDescriptior.


Shouldn't the CoreDescriptor, being more specific, take precedence over 
the config?


--
mit freundlichem Gruß,

Frank Wesemann
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.com/
Potsdamer Str. 96   Tel: +49 30 25 79 28 90
10785 BerlinFax: +49 30 25 79 28 999

Sitz: Berlin
Amtsgericht Berlin Charlottenburg (HRB 73099)
Geschäftsführer: Ali Paczensky





Re: Solr CoreAdmin create ignores dataDir Parameter

2010-09-10 Thread Mark Miller
On 9/10/10 7:00 AM, Frank Wesemann wrote:
> Hello,
> if I am trying to create a new SolrCore based on an extisting one via
> the CoreAdmin HTTP API,
> 
> http://localhost:8983/solr/admin/cores?action=CREATE&name=newCore&instanceDir=old_instance&schema=newSchema.xml&dataDir=newdata
> 
> 
> 
> the dataDir parameter is ignored.
> Instead the dataDir from the solrconfig.xml is taken in account.
> 
> I had a look at the Sources and saw that the CoreContainer's create()
> method,
> calls the SolrCore Construktor with an dataDir value of "null", which
> leads to a dataDir primarily read from the config and not from the
> CoreDescriptior.
> 
> Shouldn't the CoreDescriptor, being more specific, take precedence over
> the config?
> 


I think so - what version of Solr are you using? I believe I've changed
this on trunk a few months ago.

- Mark


How to delete documents from a SOLR cloud / balance the shards in the cloud?

2010-09-10 Thread Stephan Raemy
Hi solr-cloud users,

I'm currently setting up a solr-cloud/zookeeper instance and so far,
everything works out fine. I downloaded the source from the cloud branch
yesterday and build it from source.

I've got 10 shards distributed across 4 servers and a zookeeper instance.
Searching documents with the flag "distrib=true" works out and it returns
the expected result.

But here comes the tricky question. I will add new documents every day and
therefore, I'd like to balance my shards to keep the system speedy. The
Wiki says that one can calculate the hash of a document id and then
determine the corresponding shard. But IMHO, this does not take into account
that the cloud may become bigger or shrink over time by adding or removing
shards. Obviously adding has a higher priority since one wants to reduce
the shard size to improve the response time of distributed searches.

When reading through the Wikis and existing documentation, it is still
unclear to me how to do the following operations:
- Modify/Delete a document stored in the cloud without having to store the
  document:shard mapping information outside of the cloud. I would expect
  something like shard attribute on each doc in the SOLR query result
  (activated/deactivated by a flag), so that i can query the SOLR cloud for
a
  doc and then delete it on the specific shard.
- Balance a cloud when adding/removing new shards or just balance them after
  many deletions.

Of course there are solutions to this, but at the end, I'd love to have a
true cloud where i do not have to worry about shard performance
optimization.
Hints are greatly appreciated.

Cheers,
Stephan


solr / lucene engineering positions in Boston, MA USA @ the Echo Nest

2010-09-10 Thread Brian Whitman
Hi all, brief message to let you know that we're in heavy hire mode at the
Echo Nest. As many of you know we are very heavy solr/lucene users (~1bn
documents across many many servers) and a lot of our staff have been working
with and contributing to the projects over the years. We are a "music
intelligence" company -- we crawl the web and do a lot of fancy math on
music audio and text to then provide things like recommendation, feeds,
remix capabilities, playlisting etc to a lot of music labels, social
networks and small developers via a very popular API.

We are especially interested in people with Lucene & Solr experience who
aren't afraid to get into the guts and push it to its limits. If any of
these positions fit you please let me know. We are hiring full time in the
Boston area (Davis Square, Somerville) for senior and junior engineers as
well as data architects.

http://the.echonest.com/company/jobs/

http://developer.echonest.com/docs/v4/

http://the.echonest.com/


Re: Solr CoreAdmin create ignores dataDir Parameter

2010-09-10 Thread Frank Wesemann

Mark Miller schrieb:



I think so - what version of Solr are you using? I believe I've changed
this on trunk a few months ago.

  
We are running 1.4.2 and I looked in the solr/tags/release-1.4.1 branch 
of SVN.
The Version in trunk I can see is from 27.07.2010 and this also reads 
first config and than the CoreDescriptor.

I added a command to SOLR-1905 regarding this issue.


- Mark
  



--
mit freundlichem Gruß,

Frank Wesemann
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.com/
Potsdamer Str. 96   Tel: +49 30 25 79 28 90
10785 BerlinFax: +49 30 25 79 28 999

Sitz: Berlin
Amtsgericht Berlin Charlottenburg (HRB 73099)
Geschäftsführer: Ali Paczensky





Re: Date faceting +1MONTH problem

2010-09-10 Thread Dennis Gearon
My plan has been to use unix timestamps as integer fields. I also was ooing to 
use 'all balls' time for dates without time. Midnight is actually AM, so I was 
going to count it as the next day.

To get my range, I was going to use a greater than, and then a less than for 
the two intgeres, calculating them outside of Solr/Lucene and putting them into 
the query.

Anybody have any thoughts to how fast that would be compared to the range query 
for dates?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/9/10, Liam O'Boyle  wrote:

> From: Liam O'Boyle 
> Subject: Re: Date faceting +1MONTH problem
> To: solr-user@lucene.apache.org
> Date: Thursday, September 9, 2010, 4:20 PM
> Hi Chris,
> 
> Yes, I saw the facet.range.include feature and briefly
> tried to implement it
> before realising that it was Solr 3.1 only :)  I agree
> that it seems like
> the best solution to problem.
> 
> Reindexing with a +1MILLI hack had occurred to me and I
> guess that's what
> I'll do in the meantime; it just seemed like something that
> people must have
> run into before!  I suppose it depends on the
> granularity of your
> timestamps; all of my values are actually just dates, so
> I've been putting
> them in as the date with T00:00:00.000Z, which makes the
> overlap problem
> very obvious.
> 
> If anyone else has come across a solution for this, feel
> free to suggest
> another approach, otherwise it's reindexing time.
> 
> Cheers,
> Liam
> 
> 
> On Fri, Sep 10, 2010 at 8:38 AM, Chris Hostetter
> wrote:
> 
> > : I'm trying to break down the data over a year into
> facets by month; to
> > avoid
> > : overlap, I'm using -1MILLI on the start and end
> dates and using a gap of
> > : +1MONTH.
> > :
> > : However, it seems like February completely breaks my
> monthly cycles,
> > leading
> >
> > Yep.
> >
> > Everything you posted makes sense to me in how
> DateMath works - "Jan 31 @
> > 23:59.999" + "1 MONTH" results in "Feb 28 @ 23:59.999"
> ... at which point
> > adding "1 MONTH" to that results in "Mar 28 @ ..."
> because there is no
> > context of what the initial starting point was.
> >
> > It's not a situation i've ever personally run into ...
> one workarround
> > would be to use a "+1MILLI" fudge factor at indexing
> time, instead of a
> > "-1MILLI" fudge factor at query time ... that
> shouldn't have this problem.
> >
> > If you'd like to open a bug to trak this, I think it
> might be possible to
> > fix this behavior (there are some features in the Java
> calendaring code
> > that make things like "Jan 31 + 2 Months" do the right
> thing) but
> > personally I think working on SOLR-1896 (combined with
> the new
> > facet.range.include param) is a more effective use of
> time so
> > we can eliminate the need for this type of hack
> completely in future Solr
> > releases.
> >
> > -Hoss
> >
> > --
> > http://lucenerevolution.org/  ...  October
> 7-8, Boston
> > http://bit.ly/stump-hoss      ... 
> Stump The Chump!
> >
> >
>


Re: How to delete documents from a SOLR cloud / balance the shards in the cloud?

2010-09-10 Thread James Liu
Stephan and all,

I am evaluating this like you are. You may want to check
http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/.
I would appreciate if others can shed some light on this, too.

Bests,
James
On Fri, Sep 10, 2010 at 6:07 AM, Stephan Raemy wrote:

> Hi solr-cloud users,
>
> I'm currently setting up a solr-cloud/zookeeper instance and so far,
> everything works out fine. I downloaded the source from the cloud branch
> yesterday and build it from source.
>
> I've got 10 shards distributed across 4 servers and a zookeeper instance.
> Searching documents with the flag "distrib=true" works out and it returns
> the expected result.
>
> But here comes the tricky question. I will add new documents every day and
> therefore, I'd like to balance my shards to keep the system speedy. The
> Wiki says that one can calculate the hash of a document id and then
> determine the corresponding shard. But IMHO, this does not take into
> account
> that the cloud may become bigger or shrink over time by adding or removing
> shards. Obviously adding has a higher priority since one wants to reduce
> the shard size to improve the response time of distributed searches.
>
> When reading through the Wikis and existing documentation, it is still
> unclear to me how to do the following operations:
> - Modify/Delete a document stored in the cloud without having to store the
>  document:shard mapping information outside of the cloud. I would expect
>  something like shard attribute on each doc in the SOLR query result
>  (activated/deactivated by a flag), so that i can query the SOLR cloud for
> a
>  doc and then delete it on the specific shard.
> - Balance a cloud when adding/removing new shards or just balance them
> after
>  many deletions.
>
> Of course there are solutions to this, but at the end, I'd love to have a
> true cloud where i do not have to worry about shard performance
> optimization.
> Hints are greatly appreciated.
>
> Cheers,
> Stephan
>


Re: solr / lucene engineering positions in Boston, MA USA @ the Echo Nest

2010-09-10 Thread Yonik Seeley
On Fri, Sep 10, 2010 at 9:18 AM, Brian Whitman  wrote:
> Hi all, brief message to let you know that we're in heavy hire mode at the
> Echo Nest. As many of you know we are very heavy solr/lucene users (~1bn
> documents across many many servers) and a lot of our staff have been working
> with and contributing to the projects over the years. We are a "music
> intelligence" company -- we crawl the web and do a lot of fancy math on
> music audio and text to then provide things like recommendation, feeds,
> remix capabilities, playlisting etc to a lot of music labels, social
> networks and small developers via a very popular API.

Cool stuff Brian!
Might make a good entry for http://wiki.apache.org/solr/PublicServers ?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


RE: How to extend IndexSchema and SchemaField

2010-09-10 Thread Charlie Jackson
Have you already explored the idea of using a custom analyzer for your
field? Depending on your use case, that might work for you.

- Charlie


Autocomplete with Filter Query

2010-09-10 Thread David Yang
Hi,

 

Is there any way to provide autocomplete while filtering results?
Suppose I had a bunch of people and each person has multiple
occupations. When I select 'Assistant' in a filter box, it would be nice
if autocomplete only provides assistant names, instead of all names. The
other issue is that I use DisMax to do my search (name, title, phone
number etc) - so it might be more complex to do autocomplete. I could
have a copy field to copy all dismax terms into one big field.

 

Cheers,

 

David 



Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Savannah Beckett
Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in 
the solr index and use the data from these similar documents to modify a field 
in each document that I am indexing.  I found that MoreLikeThis in Solr only 
works when the document is in the index, is it true?  If so, I may have to wait 
til the indexing is finished, then run my own command to do MoreLikeThis to 
each 
document in the index, and then reindex each document?  It sounds like it's not 
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle 
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after 
the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>



  

No more trunk support for 2.9 indexes

2010-09-10 Thread peter . sturge

Hi,

I'm sure there are good reasons for the decision to no longer support 2.9  
format indexes in 4.0, and not have an automatic upgrade as in previous  
versions.


Since Lucene 3.0.2 is 'out there', does this mean the format is nailed  
down, and some sort of porting is possible?
Does anyone know of a tool that can read the entire contents of a Solr  
index and (re)write it another? (as an indexing operation - eg 2.9 ->  
3.0.x, so not repl)


Thanks,
Peter


Re: No more trunk support for 2.9 indexes

2010-09-10 Thread Dennis Gearon
So converting an index would be faster than remaking it?

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/10/10, peter.stu...@gmail.com  wrote:

> From: peter.stu...@gmail.com 
> Subject: No more trunk support for 2.9 indexes
> To: solr-user@lucene.apache.org
> Date: Friday, September 10, 2010, 8:20 AM
> Hi,
> 
> I'm sure there are good reasons for the decision to no
> longer support 2.9 format indexes in 4.0, and not have an
> automatic upgrade as in previous versions.
> 
> Since Lucene 3.0.2 is 'out there', does this mean the
> format is nailed down, and some sort of porting is
> possible?
> Does anyone know of a tool that can read the entire
> contents of a Solr index and (re)write it another? (as an
> indexing operation - eg 2.9 -> 3.0.x, so not repl)
> 
> Thanks,
> Peter
> 


RE: Autocomplete with Filter Query

2010-09-10 Thread Jonathan Rochkind
I've been thinking about this too, and haven't come up with any GREAT way. But 
there are several possible ways, that will do different things, good or bad, 
depending on the nature of your data and exactly what you want to do.  So here 
are some ideas I've been thinking about, but not a ready made solution for you. 

One thing first, the statement about "copy field to copy all dismax terms into 
one big field." doesn't exactly make sense. Copyfield is something that happens 
at index time, whereas dismax is only something that is used at query time.  
Since it's only used at query time, just because you are using dismax for your 
main search, doesn't mean you have to use dismax for your autocomplete query.   
The autocomplete query, that returns the things you're going to display in your 
auto-complete list, can be set up however you want.  (we are talking about an 
auto-complete list, not a "Google Instant" style autocomplete, right?  The 
latter would introduce even more issues). 

So, do you want the autocomplete to only match on the _entire query_ as 
entered, or do you want an autocomplete for each word?  For instance, if I 
enter "dog walking", should the autocomplete be autocompleting "dog walking" as 
a whole, or should it be autocompleting "walking" by the time I've typed in 
"dog walking"?  It's easier to set up to autocomplete on the whole phrase. 

Next, though, you probably want autocomplete to complete on partial words, not 
just complete words. "Dog wal" should autocomplete to "dog walking". That 
introduces an extra kink too. But let's assume we want that. 

So one idea. At index time, populate a field that will be used exclusively for 
auto-completing. Make this field actually _non-tokenizing_, probably a Text 
type but with the KeywordTokenizer (ie, the non-tokenizing tokenizer, heh).   
So if you're indexing "dog walking", then the token in the field is actually 
"dog walking", not ["dog","walking"].   Next, normalize it by removing 
punctuation (because we probably don't want to consider punctuation for 
auto-completing), and maybe normalizing whitespace by collapsing any adjacent 
whitespace to a single space, and removing whitespace at beginning and end. So 
"   dog walking   " will index as "dog walking". (This matters more at 
query time then index time, but less confusing to do the same normalization at 
both points).  That can be done with a charpatternfilter.  

But now we've also got to n-gram expand it.  So if the term being indexed is 
"dog walking", we actually want to store ALL these terms in the index:
"d"
"do"
"dog"
"dog "
"dog w"
"dog wa"
etc

Ie, n-grams, but only expanded out from the front.  I believe you can use the 
EdgeNGramFilterFactory for this (at index time only, this one you don't want in 
your query-time analyzers).  Although I haven't actually tried the 
EdgeNGramFilterFactory with a non-tokenized field, I think it should work. This 
will expand the size of your index, hopefully not to a problematic degree. 

Now, to actually do the auto-complete. At query time, take the whole thing the 
user has entered, and issue a query, with whatever fq's you want too, but use 
the "field" type query parser (NOT "dismax" or "lucene", because we don't want 
the query parser to pre-tokenize on whitespace, but not "raw" because we DO 
want to go through the query-time field analyzers), restricted to this 
autocomplete field you've created. One way to do this is:  << q={!field 
f=my_autocomplete_field}the user's query >> (url-encoded, naturally). 

That's pretty much it, I think that should work, depending on the requirements 
of 'work'.  Although I haven't tried it yet. 

Now, if you want the user's query to auto-complete match in the middle of your 
terms, things get a lot more complicated. Ie, if you want "walk" to 
auto-complete to "dog walking" too.  This won't do that.  Also, if you want 
some kind of stemming to happen in auto-complete, this won't do that either. 
And also, if you want to auto-complete not the entire phrase the user has typed 
in, but each white-space-seperated word as they type it, this won't do THAT 
either.  Trying to get all those things to work becomes even more complicated 
-- especially with the requirement that you want to be able to apply the 'fq's 
from your current search context to the auto-complete.  I haven't entirely 
thought through a possible way to do all that. 

But hopefully this gives you some clues to think about it. 

Jonathan

From: David Yang [dy...@nextjump.com]
Sent: Friday, September 10, 2010 11:14 AM
To: solr-user@lucene.apache.org
Subject: Autocomplete with Filter Query

Hi,



Is there any way to provide autocomplete while filtering results?
Suppose I had a bunch of people and each person has multiple
occupations. When I select 'Assistant' in a filter box, it would be nice
if autocomplete only provides assistant names, instead of all names. The
other issue is that I use DisMax 

Re: No more trunk support for 2.9 indexes

2010-09-10 Thread Peter Sturge
If a tool exists for converting 2.9->3.0.x, it would likely be faster.
Do you know if such a tool exists?
Remaking the index, in my case, can only be done from the existing
index because the original data is no longer available (it is
transient network data).
I suppose an index 'remaker' might be something like a DIH reader for
a Solr index - streams everything out of the existing index, writing
it into the new one?


Peter



On Fri, Sep 10, 2010 at 4:39 PM, Dennis Gearon  wrote:
> So converting an index would be faster than remaking it?
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 9/10/10, peter.stu...@gmail.com  wrote:
>
>> From: peter.stu...@gmail.com 
>> Subject: No more trunk support for 2.9 indexes
>> To: solr-user@lucene.apache.org
>> Date: Friday, September 10, 2010, 8:20 AM
>> Hi,
>>
>> I'm sure there are good reasons for the decision to no
>> longer support 2.9 format indexes in 4.0, and not have an
>> automatic upgrade as in previous versions.
>>
>> Since Lucene 3.0.2 is 'out there', does this mean the
>> format is nailed down, and some sort of porting is
>> possible?
>> Does anyone know of a tool that can read the entire
>> contents of a Solr index and (re)write it another? (as an
>> indexing operation - eg 2.9 -> 3.0.x, so not repl)
>>
>> Thanks,
>> Peter
>>
>


Re: Building query based on value of boolean field

2010-09-10 Thread PeterKerk

Oh and the field in the result looks like:
false

but when I do this: q=partylocation:false

I still get no results! :s
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Building-query-based-on-value-of-boolean-field-tp1449018p1453266.html
Sent from the Solr - User mailing list archive at Nabble.com.


what would cause large numbers of executeWithRetry INFO messages?

2010-09-10 Thread solr-user

I see a large number (~1000) of the following executeWithRetry messages in my
apache catalina log files every day (see bolded snippet below).  They seem
to appear at random intervals.

Since they are not flagged as errors or warnings, I have been ignoring them
for now.  However, I started wondering if "INFO" message is a red-herring
and thinking there might be an actual problem somewhere.

Does anyone know what would cause this type of message?  Are they normal?  I
have not seen anything in my google searches for solr that contain this
message

Details:

1. My CPU usage seems fine as does my heap; we have lots of cpu capacity and
heap space
2. The log is from a searcher but I know that the intervals do not
correspond to replication (every 15 min on the hour)
3. the INFO lines appear in all searcher logs (we have a number of
searchers)
4. the data is around 10m records per searcher and occupies around 14gb
5. I am not noticing any problems performing queries on the solr (so no
trace info to give you); performance and queries seem fine

Log snippet:
Sep 10, 2010 2:17:59 AM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
Sep 10, 2010 2:18:20 AM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException)
caught when processing request: The server xxx.admin.inf failed to respond
Sep 10, 2010 2:18:20 AM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: Retrying request
Sep 10, 2010 2:18:20 AM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.

any info appreciated.  thx
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/what-would-cause-large-numbers-of-executeWithRetry-INFO-messages-tp1453417p1453417.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to extend IndexSchema and SchemaField

2010-09-10 Thread Renaud Delbru

 Hi Javier,

On 10/09/10 07:15, Javier Diaz wrote:

Looking at the code we found out that there's no way to extend the schema.
Finally we copied part of the code that reads the schema in our
RequestHandler. It works but I'm not sure if it's the best way to do it. Let
me know if you want our code as an example.
So, do you mean you are duplicating part of the code for reading the 
schema and parse on your own way the schema in your request handler ?
If you could share the code to have a look, it could be helpful and 
inspiring. cheers.

--
Renaud Delbru


Re: How to extend IndexSchema and SchemaField

2010-09-10 Thread Renaud Delbru

 Hi Charlie,

On 10/09/10 16:11, Charlie Jackson wrote:

Have you already explored the idea of using a custom analyzer for your
field? Depending on your use case, that might work for you.
Yes, I have thought of that, or even extending field type. But this does 
not work for my use case, since I can have multiple fields of a same 
type (therefore with the same field type, and same analyzer), but each 
one of them needs specific information. Therefore, I think the only 
"nice" way to achieve this is to have the possibility to add attributes 
to any field definition.


cheers
--
Renaud Delbru


Re: Delta Import with something other than Date

2010-09-10 Thread Alexey Serba
> Can you provide a sample of passing the parameter via URL? And how using it 
> would look in the data-config.xml
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters


RE: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Jonathan Rochkind
"More like this" is intended to be run at query time. For what reasons are you 
thinking you want to (re-)index each document based on the results of 
MoreLikeThis?  You're right that that's not what the component is intended for. 

Jonathan

From: Savannah Beckett [savannah_becket...@yahoo.com]
Sent: Friday, September 10, 2010 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: How to Update Value of One Field of a Document in Index?

Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in
the solr index and use the data from these similar documents to modify a field
in each document that I am indexing.  I found that MoreLikeThis in Solr only
works when the document is in the index, is it true?  If so, I may have to wait
til the indexing is finished, then run my own command to do MoreLikeThis to each
document in the index, and then reindex each document?  It sounds like it's not
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle 
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after
the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>





RE: Autocomplete with Filter Query

2010-09-10 Thread David Yang
Cool idea! 
I was suggesting a copy field because I want to provide autocomplete on
any field that the dismax can search on - eg if dismax searches both
name and phone, then when they start typing name or phone, I want it to
give autocompletion there 

So to get your idea clear are you suggesting a field like this:





And searching like this: 
solr/core/select?q=Autocomplete:(dog wal)&fq=UserSelectedFilter

On a related note: how do you deal with no exact ngram match, but some
relevant ngrams? E.g. user types "dog wam" and it finds no ngrams with
"dog wam" but there are ngrams for "dog wal" (for dog walking) - this is
probably not too relevant though since mostly prefix suggestion should
be enough.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Friday, September 10, 2010 11:41 AM
To: solr-user@lucene.apache.org
Subject: RE: Autocomplete with Filter Query

I've been thinking about this too, and haven't come up with any GREAT
way. But there are several possible ways, that will do different things,
good or bad, depending on the nature of your data and exactly what you
want to do.  So here are some ideas I've been thinking about, but not a
ready made solution for you. 

One thing first, the statement about "copy field to copy all dismax
terms into one big field." doesn't exactly make sense. Copyfield is
something that happens at index time, whereas dismax is only something
that is used at query time.  Since it's only used at query time, just
because you are using dismax for your main search, doesn't mean you have
to use dismax for your autocomplete query.   The autocomplete query,
that returns the things you're going to display in your auto-complete
list, can be set up however you want.  (we are talking about an
auto-complete list, not a "Google Instant" style autocomplete, right?
The latter would introduce even more issues). 

So, do you want the autocomplete to only match on the _entire query_ as
entered, or do you want an autocomplete for each word?  For instance, if
I enter "dog walking", should the autocomplete be autocompleting "dog
walking" as a whole, or should it be autocompleting "walking" by the
time I've typed in "dog walking"?  It's easier to set up to autocomplete
on the whole phrase. 

Next, though, you probably want autocomplete to complete on partial
words, not just complete words. "Dog wal" should autocomplete to "dog
walking". That introduces an extra kink too. But let's assume we want
that. 

So one idea. At index time, populate a field that will be used
exclusively for auto-completing. Make this field actually
_non-tokenizing_, probably a Text type but with the KeywordTokenizer
(ie, the non-tokenizing tokenizer, heh).   So if you're indexing "dog
walking", then the token in the field is actually "dog walking", not
["dog","walking"].   Next, normalize it by removing punctuation (because
we probably don't want to consider punctuation for auto-completing), and
maybe normalizing whitespace by collapsing any adjacent whitespace to a
single space, and removing whitespace at beginning and end. So "   dog
walking   " will index as "dog walking". (This matters more at query
time then index time, but less confusing to do the same normalization at
both points).  That can be done with a charpatternfilter.  

But now we've also got to n-gram expand it.  So if the term being
indexed is "dog walking", we actually want to store ALL these terms in
the index:
"d"
"do"
"dog"
"dog "
"dog w"
"dog wa"
etc

Ie, n-grams, but only expanded out from the front.  I believe you can
use the EdgeNGramFilterFactory for this (at index time only, this one
you don't want in your query-time analyzers).  Although I haven't
actually tried the EdgeNGramFilterFactory with a non-tokenized field, I
think it should work. This will expand the size of your index, hopefully
not to a problematic degree. 

Now, to actually do the auto-complete. At query time, take the whole
thing the user has entered, and issue a query, with whatever fq's you
want too, but use the "field" type query parser (NOT "dismax" or
"lucene", because we don't want the query parser to pre-tokenize on
whitespace, but not "raw" because we DO want to go through the
query-time field analyzers), restricted to this autocomplete field
you've created. One way to do this is:  << q={!field
f=my_autocomplete_field}the user's query >> (url-encoded, naturally). 

That's pretty much it, I think that should work, depending on the
requirements of 'work'.  Although I haven't tried it yet. 

Now, if you want the user's query to auto-complete match in the middle
of your terms, things get a lot more complicated. Ie, if you want "walk"
to auto-complete to "dog walking" too.  This won't do that.  Also, if
you want some kind of stemming to happen in auto-complete, this won't do
that either. And also, if you want to auto-complete not the entire
phrase the user has typed in, but each white-space-seperated word as
they type it, this won't do THAT eit

Sorting not working on a string field

2010-09-10 Thread noel
Hello, I seem to be having a problem with sorting. I have a string field 
(time_code) that I want to order by. When the results come up, it displays the 
results differently from relevance which I would assume, but the results aren't 
ordered. The data in time_code came from a numeric decimal with a six digit 
precision if that makes a difference(ex: 1.00).

Here is the query I give it:

q=ceremony+AND+presentation_id%3A296+AND+type%3Ablob&version=1.3&json.nl=map&rows=10&start=0&wt=json&hl=true&hl.fl=text&hl.simple.pre=&hl.simple.post=<%2Fspan>&hl.fragsize=0&hl.mergeContiguous=false&&sort=time_code+asc


And here's the field schema:













Thanks for any help.



Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Savannah Beckett
I want to do MoreLikeThis to find documents that are similar to the document 
that I am indexing.  Then I want to calculate the average of one of the fields 
of all those documents and input this average into a field of the document that 
I am indexing.  From my research, it seems that MoreLikeThis can only be used 
to 
find similarity of document that is already in the index.  So, I think I need 
to 
index it first, and then use MoreLikeThis to find similar documents in the 
index 
and then reindex that document.  Any better way?  I try not to reindex a 
document because it's not efficient.  I don't have to use MoreLikeThis.
Thanks.




From: Jonathan Rochkind 
To: "solr-user@lucene.apache.org" 
Sent: Fri, September 10, 2010 9:58:20 AM
Subject: RE: How to Update Value of One Field of a Document in Index?

"More like this" is intended to be run at query time. For what reasons are you 
thinking you want to (re-)index each document based on the results of 
MoreLikeThis?  You're right that that's not what the component is intended for. 


Jonathan

From: Savannah Beckett [savannah_becket...@yahoo.com]
Sent: Friday, September 10, 2010 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: How to Update Value of One Field of a Document in Index?

Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in
the solr index and use the data from these similar documents to modify a field
in each document that I am indexing.  I found that MoreLikeThis in Solr only
works when the document is in the index, is it true?  If so, I may have to wait
til the indexing is finished, then run my own command to do MoreLikeThis to each
document in the index, and then reindex each document?  It sounds like it's not
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle 
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after
the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>


  

SEVERE: java.io.IOException: The specified network name is no longer available

2010-09-10 Thread brian519

Hi all,

Using Solr 1.4 hosted with Tomcat 6 on Windows 2003 (sometimes Windows 2008)

Occasionally, we can't search anymore and this error shows up in the log
file:

SEVERE: java.io.IOException: The specified network name is no longer
available
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(Unknown Source)
at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.readInternal(SimpleFSDirectory.java:132)
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:80)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:64)
at 
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:129)
at 
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:160)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232)
at 
org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:299)
at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:939)
at
org.apache.lucene.index.DirectoryReader$MultiTermEnum.(DirectoryReader.java:973)
at 
org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:620)
at 
org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302)
at
org.apache.solr.handler.component.TermsComponent.process(TermsComponent.java:82)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
at java.lang.Thread.run(Unknown Source)


Once we see the error, it is persistent.  Restarting Tomcat makes the error
stop.  This is happening across a variety of deployments and networks, so I
don't think there is an actual network problem.  Many other apps operate
fine on the same server(s)/network(s). 

We are using a file server (again with many different configurations) to
store the actual index files, which are not on the same machine as
Solr/Tomcat.   On a few deployments we're moving the index local to Solr to
see if that corrects the problem.

My questions are:

1. Has anyone seen something like this before?
1. Is putting the index files on a file server a supported configuration?

Thanks for any help
Brian.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SEVERE-java-io-IOException-The-specified-network-name-is-no-longer-available-tp1454016p1454016.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SEVERE: java.io.IOException: The specified network name is no longer available

2010-09-10 Thread Yonik Seeley
On Fri, Sep 10, 2010 at 2:12 PM, brian519  wrote:
> Once we see the error, it is persistent.  Restarting Tomcat makes the error
> stop.  This is happening across a variety of deployments and networks, so I
> don't think there is an actual network problem.  Many other apps operate
> fine on the same server(s)/network(s).

Hmmm, that's interesting.  I wonder if it's a Java bug or something?
There's nothing in lucene/solr that I know of that would lead to "The
specified network name is no longer
available".

What JVM are you using?

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: SEVERE: java.io.IOException: The specified network name is no longer available

2010-09-10 Thread brian519


Yonik Seeley-2-2 wrote:
> 
> 
> Hmmm, that's interesting.  I wonder if it's a Java bug or something?
> There's nothing in lucene/solr that I know of that would lead to "The
> specified network name is no longer
> available".
> 
> What JVM are you using?
> 
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
> 
> 

java -version gives me this:

java version "1.6.0_18"
Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

Is that what you're looking for?  

Can you confirm that putting the Index files on a network share is OK?   We
used to use Lucene and I know that this was bad news .. but since getting
these errors we're wondering whether Solr might have a similar issue.  

I found this Java bug, but it's really old and I'm not sure if it's
relevant:  http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6176034

Thanks Yonik
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SEVERE-java-io-IOException-The-specified-network-name-is-no-longer-available-tp1454016p1454271.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SEVERE: java.io.IOException: The specified network name is no longer available

2010-09-10 Thread Erick Erickson
SOLR uses Lucene, so I'd expect that putting your index files on a share
isn't any more robust under SOLR than Lucene

Sounds to me like your network's glitchy.

FWIW
Erick

On Fri, Sep 10, 2010 at 3:00 PM, brian519  wrote:

>
>
> Yonik Seeley-2-2 wrote:
> >
> >
> > Hmmm, that's interesting.  I wonder if it's a Java bug or something?
> > There's nothing in lucene/solr that I know of that would lead to "The
> > specified network name is no longer
> > available".
> >
> > What JVM are you using?
> >
> > -Yonik
> > http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
> >
> >
>
> java -version gives me this:
>
> java version "1.6.0_18"
> Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
> Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)
>
> Is that what you're looking for?
>
> Can you confirm that putting the Index files on a network share is OK?   We
> used to use Lucene and I know that this was bad news .. but since getting
> these errors we're wondering whether Solr might have a similar issue.
>
> I found this Java bug, but it's really old and I'm not sure if it's
> relevant:  http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6176034
>
> Thanks Yonik
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SEVERE-java-io-IOException-The-specified-network-name-is-no-longer-available-tp1454016p1454271.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SEVERE: java.io.IOException: The specified network name is no longer available

2010-09-10 Thread brian519


Erick Erickson wrote:
> 
> SOLR uses Lucene, so I'd expect that putting your index files on a share
> isn't any more robust under SOLR than Lucene
> 
> Sounds to me like your network's glitchy.
> 
> 

Except that with Lucene we had separate processes searching and indexing
directly against the files over the network, so it wouldn't be possible to
have any centralized locking logic beyond NTFS.  

Solr is different in that indexing and searching requests go through Solr
and are passed to Lucene through the same process.  This is why it sounds
plausible to me that Solr could handle storing the index files on a network
share.

Originally I blamed the network too.  But given that other apps that are
using the same file server are working fine, and that we have this problem
across completely separate networks (ie, different clients of ours that have
completely different infrastructures), I don't think that it is the network.

Does anyone else store their Solr index on a file share, or NAS or SAN or
anything other than a local disk?

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SEVERE-java-io-IOException-The-specified-network-name-is-no-longer-available-tp1454016p1454468.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr CoreAdmin create ignores dataDir Parameter

2010-09-10 Thread MitchK

Frank,

have a look at SOLR-646.

Do you think a workaround for the data-dir-tag in the solrconfig.xml can
help?
I think about something like ${solr./data/corename} for
illustration.

Unfortunately I am not very skilled in working with solr's variables and
therefore I do not know what variables are available. 

If we find a solution, we should provide it as a suggestion at the wiki's
CoreAdmin-page.

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-CoreAdmin-create-ignores-dataDir-Parameter-tp1451665p1454705.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Autocomplete with Filter Query

2010-09-10 Thread Peter Karich
Hi there,

I don't know if my idea is perfect but it seems to work ok in my
twitter-search prototype:
http://www.jetwick.com
(keep in mind it is a vhost and only one fat index, no sharding, etc...
so performance isn't perfect ;-))

That said, type in 'so' and you will get 'soldier', 'solar', ... but
this is context sensitive:
e.g. search 'timetabling' (shameless self propaganda). Then again type
'so' in the query box you'll get (a) different suggestion(s) 'solr'.
The same context dependency stuff (should ;-)) work for the filters on
the left side.

How it works?

I am indexing into a special 'tag' field only relevant terms from the
tweet (removing a lot noise word, removing whitespaces and strange chars).
then I am doing a faceted search with a tag.facet.prefix= (or so)
and I get context sensitive suggestions, because I can use fq as well.
Now if the query contains at least two terms I am splitting the query:
the last term goes to the facet prefix parameter and the first term(s)
go to q

e.g. 'michael ja'=> q=michael&tag.facet.prefix=ja and I will get back
'jackie','jackson'.

Regards,
Peter.


PS: jetwick even has a google instant alike featue: when you are
selecting the suggestions it will update the results ...
google instant is too disruptive in my opinion (results moving up and
down because of different number of suggestions),
but I am working on a less disruptive solution which doesn't hide the
first results

> Cool idea! 
> I was suggesting a copy field because I want to provide autocomplete on
> any field that the dismax can search on - eg if dismax searches both
> name and phone, then when they start typing name or phone, I want it to
> give autocompletion there 
>
> So to get your idea clear are you suggesting a field like this:
>
> 
> 
> 
>
> And searching like this: 
> solr/core/select?q=Autocomplete:(dog wal)&fq=UserSelectedFilter
>
> On a related note: how do you deal with no exact ngram match, but some
> relevant ngrams? E.g. user types "dog wam" and it finds no ngrams with
> "dog wam" but there are ngrams for "dog wal" (for dog walking) - this is
> probably not too relevant though since mostly prefix suggestion should
> be enough.
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
> Sent: Friday, September 10, 2010 11:41 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Autocomplete with Filter Query
>
> I've been thinking about this too, and haven't come up with any GREAT
> way. But there are several possible ways, that will do different things,
> good or bad, depending on the nature of your data and exactly what you
> want to do.  So here are some ideas I've been thinking about, but not a
> ready made solution for you. 
>
> One thing first, the statement about "copy field to copy all dismax
> terms into one big field." doesn't exactly make sense. Copyfield is
> something that happens at index time, whereas dismax is only something
> that is used at query time.  Since it's only used at query time, just
> because you are using dismax for your main search, doesn't mean you have
> to use dismax for your autocomplete query.   The autocomplete query,
> that returns the things you're going to display in your auto-complete
> list, can be set up however you want.  (we are talking about an
> auto-complete list, not a "Google Instant" style autocomplete, right?
> The latter would introduce even more issues). 
>
> So, do you want the autocomplete to only match on the _entire query_ as
> entered, or do you want an autocomplete for each word?  For instance, if
> I enter "dog walking", should the autocomplete be autocompleting "dog
> walking" as a whole, or should it be autocompleting "walking" by the
> time I've typed in "dog walking"?  It's easier to set up to autocomplete
> on the whole phrase. 
>
> Next, though, you probably want autocomplete to complete on partial
> words, not just complete words. "Dog wal" should autocomplete to "dog
> walking". That introduces an extra kink too. But let's assume we want
> that. 
>
> So one idea. At index time, populate a field that will be used
> exclusively for auto-completing. Make this field actually
> _non-tokenizing_, probably a Text type but with the KeywordTokenizer
> (ie, the non-tokenizing tokenizer, heh).   So if you're indexing "dog
> walking", then the token in the field is actually "dog walking", not
> ["dog","walking"].   Next, normalize it by removing punctuation (because
> we probably don't want to consider punctuation for auto-completing), and
> maybe normalizing whitespace by collapsing any adjacent whitespace to a
> single space, and removing whitespace at beginning and end. So "   dog
> walking   " will index as "dog walking". (This matters more at query
> time then index time, but less confusing to do the same normalization at
> both points).  That can be done with a charpatternfilter.  
>
> But now we've also got to n-gram expand it.  So if the term being
> indexed is "dog walking", we actually want to stor

RE: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Markus Jelsma
The MoreLikeThis component actually can accept external input:

http://wiki.apache.org/solr/MoreLikeThisHandler#Using_ContentStreams
 
-Original message-
From: Jonathan Rochkind 
Sent: Fri 10-09-2010 18:59
To: solr-user@lucene.apache.org; 
Subject: RE: How to Update Value of One Field of a Document in Index?

"More like this" is intended to be run at query time. For what reasons are you 
thinking you want to (re-)index each document based on the results of 
MoreLikeThis?  You're right that that's not what the component is intended for. 

Jonathan

From: Savannah Beckett [savannah_becket...@yahoo.com]
Sent: Friday, September 10, 2010 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: How to Update Value of One Field of a Document in Index?

Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in
the solr index and use the data from these similar documents to modify a field
in each document that I am indexing.  I found that MoreLikeThis in Solr only
works when the document is in the index, is it true?  If so, I may have to wait
til the indexing is finished, then run my own command to do MoreLikeThis to each
document in the index, and then reindex each document?  It sounds like it's not
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle 
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after
the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>





Re: Inconsistent search results with multiple keywords

2010-09-10 Thread Ron Mayer
Stéphane Corlosquet wrote:
> Hi all,
> 
> I'm new to solr so please let me know if there is a more appropriate place
> for my question below.
> 
> I'm noticing a rather unexpected number of results when I add more keywords
> to a search. I'm listing below a example (where I replaced the real keywords
> with placeholders):
> 
> keyword1 851 hits
> keyword1 keyword2  90 hits
> keyword1 keyword2 keyword3 269 hits
> keyword1 keyword2 keyword3 keyword4 47 hits
> 
> As you can see, adding k2 narrows down the amount of results (as I would
> expect), but adding k3 to k1 and k2 suddenly increases the amount of
> results. with 4 keywords, the results have been narrowed down again.

My guess - you might have it configured it so at least 60% of keywords have to 
hit.

For 1 or 2 keywords, that means they all need to hit.
For 3 keywords, that means 2 of the 3 need to match.
For 4 keywords, that means 3 of the 4 need to hit.

Or you might have a more complicated expression with effectively the same 
results.

It might be this "mm" parameter you're looking for:
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29


Solr memory use, jmap and TermInfos/tii

2010-09-10 Thread Burton-West, Tom
Hi all,

When we run the first query after starting up Solr, memory use goes up from 
about 1GB to 15GB and never goes below that level.  In debugging a recent OOM 
problem I ran jmap with the output appended below.  Not surprisingly, given the 
size of our indexes, it looks like the TermInfo and Term data structures which 
are the in-memory representation of the tii file are taking up most of the 
memory. This is running Solr under Tomcat with 16GB allocated to the jvm and 3 
shards each with a tii file of about 600MB.

Total index size is about 400GB for each shard (we are indexing about 600,000 
full-text books in each shard).

In interpreting the jmap output, can we assume that the listings for utf8 
character arrays ("[C"), java.lang.String, long int arrays ("[J), and int 
arrays ("[i) are all part of the data structures involved in representing the 
tii file in memory?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

(jmap output, commas in numbers added)

num #instances #bytes  class name
--
   1:  82,496,803 4,273,137,904  [C
   2:  82,498,673 3,299,946,920  java.lang.String
   3:  27,810,887 1,112,435,480  org.apache.lucene.index.TermInfo
   4:  27,533,080 1,101,323,200  org.apache.lucene.index.TermInfo
   5:  27,115,577 1,084,623,080  org.apache.lucene.index.TermInfo
   6:  27,810,894  889,948,608  org.apache.lucene.index.Term
   7:  27,533,088  881,058,816  org.apache.lucene.index.Term
   8:  27,115,589  867,698,848  org.apache.lucene.index.Term
   9:   148  659,685,520  [J
  10: 2  222,487,072  [Lorg.apache.lucene.index.Term;
  11: 2  222,487,072  [Lorg.apache.lucene.index.TermInfo;
  12: 2  220,264,600  [Lorg.apache.lucene.index.Term;
  13: 2  220,264,600  [Lorg.apache.lucene.index.TermInfo;
  14: 2  216,924,560  [Lorg.apache.lucene.index.Term;
  15: 2  216,924,560  [Lorg.apache.lucene.index.TermInfo;
  16:737,060  155,114,960  [I
  17:627,793   35,156,408  java.lang.ref.SoftReference






Solr and jvm Garbage Collection tuning

2010-09-10 Thread Burton-West, Tom
We have noticed that when the first query hits Solr after starting it up, 
memory use increases significantly, from about 1GB to about 16GB, and then as 
queries are received it goes up to about 19GB at which point there is a Full 
Garbage Collection which takes about 30 seconds and then memory use drops back 
down to 16GB.  Under a relatively heavy load, the full GC happens about every 
10-20 minutes.

 We are running 3 Solr shards under one Tomcat with 20GB allocated to the jvm.  
Each shard has a total index size of about 400GB on and a tii size of about 
600MB and indexes about 650,000 full-text books. (The server has a total of 
72GB of memory, so we are leaving quite a bit of memory for the OS disk cache).

Is there some argument we could give the jvm so that it would collect garbage 
more frequently? Or some other JVM tuning action that might reduce the amount 
of time where Solr is waiting on GC?

If we could get the time for each GC to take under a second, with the trade-off 
being that GC  would occur much more frequently, that would help us avoid the 
occasional query taking more than 30 seconds at the cost of a larger number of 
queries taking at least a second.


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search




Re: Null Pointer Exception with shards&facets where some shards have no values for some facets.

2010-09-10 Thread Ron Mayer
Ron Mayer wrote:
> Yonik Seeley wrote:
>> I just checked in the last part of those changes that should eliminate
>> any restriction on key.
>> But, that last part dealt with escaping keys that contained whitespace or }
>> Your example really should have worked after my previous 2 commits.
>> Perhaps not all of the servers got successfully upgraded?
> 
> Yes, quite possible.
> 
>> Can you try trunk again now?
> 
> Will check sometime tomorrow.


Yes, looks good now.
Thanks!



Re: Null Pointer Exception with shards&facets where some shards have no values for some facets.

2010-09-10 Thread Yonik Seeley
On Fri, Sep 10, 2010 at 7:21 PM, Ron Mayer  wrote:
> Ron Mayer wrote:
> Yes, looks good now.
> Thanks!

Great, thanks for the report!

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


multivalued fields in result

2010-09-10 Thread Jason Chaffee
Is it possible to return multivalued files in the result?  

I would like to have a multivalued field that is stored and not indexed (I also 
copy the same field into another field where it is tokenized and indexed).  I 
would then like all the values of this field returned in the result set.  Is 
there a way to do this?

If it is not possible, could someone elaborate why that is so that I may see if 
I can make it work.

thanks,

Jason


Re: How to extend IndexSchema and SchemaField

2010-09-10 Thread Lance Norskog
How about this:


value


It generally would be better to keep the attribute space clean and
make it very clear you are doing something unique to this field.

On Fri, Sep 10, 2010 at 9:16 AM, Renaud Delbru  wrote:
>  Hi Charlie,
>
> On 10/09/10 16:11, Charlie Jackson wrote:
>>
>> Have you already explored the idea of using a custom analyzer for your
>> field? Depending on your use case, that might work for you.
>
> Yes, I have thought of that, or even extending field type. But this does not
> work for my use case, since I can have multiple fields of a same type
> (therefore with the same field type, and same analyzer), but each one of
> them needs specific information. Therefore, I think the only "nice" way to
> achieve this is to have the possibility to add attributes to any field
> definition.
>
> cheers
> --
> Renaud Delbru
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr and jvm Garbage Collection tuning

2010-09-10 Thread Kent Fitch
Hi Tim,

For what it is worth,  behind Trove (http://trove.nla.gov.au/) are 3
SOLR-managed indices and 1 Lucene index. None of ours is as big as one
of your shards, and one of our SOLR-managed indices is tiny, but your
experiences with long GC pauses are familar to us.

One of the most difficult indices to tune is our bibliographic index
of around 38M mostly metadata records which is around 125GB and 97MB
tii files.

We need to commit updates and reopen the index every 90 seconds, and
the facet recalculation (using UnInverted) was taking quite a lot of
time, and seemed to generate lots of objects to be collected on each
reopening.

Although we've been through several rounds of tuning which have seemed
to work, at least temporarily, a few months ago we started getting 12
sec "full gc" times every 90 secs, which was no good!

We've noticed/did three things:

1) optimise to 1 segment - we'd got to the stage where 50% of the
documents had been updated (hence deleted), and the maxdocid was 50%
bigger than it needed to be, and hence datastructures whose size was
proportional to maxdocid had increased a lot.  Optimising to 1 segment
greatly reduced full GC frequency and times.

2) for most of our facets, forcing the facets to be filters rather
than uninverted happened to work better - but this depends on many
factors, and certainly isnt a cure-all for all facets - uninverted
often works much better than filters!

3) after lots of benchmarking real updates and queries on a dev
system, we came up with this set of JVM parameters that worked "best"
for our environment (at the moment!):

-Xmx17000M -XX:NewSize=3500M -XX:SurvivorRatio=3
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC \
-XX:+CMSIncrementalMode

I can't say exactly why, except that with this combination of
parameters and our data, a much bigger newgen led to less movement of
objects to oldgen, and non-full-GC collections on oldgen worked much
better.  Currently we are seeing less than 10 Full GC's a day, and
they almost always take less than 4 seconds.

This index is running on an 8 core X5570 machine with 64GB, sharing it
with a large/busy mysql instance and the Trove web server.

One of our other indices is only updated once per day, but is larger:
33.5M docs representing full text of archived web pages, 246GB, tii
file is 36MB.

JVM parms are  -Xmx1M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC.

It also does less than 10 Full GC's per day, taking less than 5 sec each.

Our other large index, newspapers, is a native Lucene index, about
180GB with comparatively large tii of 280MB (probably for the same
reason your tii is large - the contents of this database is mostly
OCR'ed text).  This index is updated/reopened every 3 minutes (to
incorporate OCR text corrections and tagging) and we use a bitmap to
represent all facet values, which typically take 5 secs to rebuild on
each reopen.

JVM parms: -mx15000M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

Although this JVM usually does fewer than 5 GC's per day, these Full
GC's often take 20-30 seconds, and we need to test increasing the
Newsize on this JVM to see if we can reduce these pauses.

The web archive and newspaper index are running on 8 core X5570
machine with 72GB.

We are also running a separate copy/version of this index behind the
site  http://newspapers.nla.gov.au/ - the main difference is that the
Trove version using shingling (inspired by the Hathi Trust results) to
improve searches containing common words.  This other version is
running on a machine with 32GB and 8 X5460 cores and  has JVM parms:
  -mx11500M  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC


Apart from the old newspapers index, all other SOLR/lucene indices are
maintained on SSDs (Intel x25m 160GB), which whilst not having
anything to do with GCs, work very very well - we couldnt cope with
our current query volumes on rotating disk without spending a great
deal of money.  The old newspaper index is running on a SAN with 24
fast disks backing it, and we can't support the same query rate on it
as we can with the other newspaper index on SSDs (even before the
shingling change).

Kent Fitch
Trove development team
National Library of Australia