counter field

2012-04-05 Thread Manish Bafna
>
> Hi,
> Is it possible to define a field as "Counter Column" which can be
> auto-incremented.
>
> Thanks,
> Manish.
>


Re: alt attribute img tag

2012-04-05 Thread Marcelo Carvalho Fernandes
Hi Manuel,

Why don't you create a program to parse the html files, maybe using xslt,
and them submit the output to Solr?

---
Marcelo

On Thursday, April 5, 2012, Manuel Antonio Novoa Proenza <
mano...@estudiantes.uci.cu> wrote:
> Hello,
>
> I would like to know the method of extracting from the images that are in
html documents Alt attribute data
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

-- 

Marcelo Carvalho Fernandes
+55 21 8272-7970
+55 21 2205-2786


Re: query time customized boosting

2012-04-05 Thread Monmohan Singh
the problem is how do I determine for each document the degree of
separation and then apply boosting for example -
say there is a user A - with friends X, Y, Z and another User B with
friends L, M
if there is a doc in index D1, with author field as Z and another doc D2 in
index with author as L,
 It will be difficult to store degree of separation per user as an index
field on the doc D1 and D2.. because for user A D1 has higher relevance but
for user B its D2..
Is there a way to write our own elevation component in Solr..

On Thu, Apr 5, 2012 at 12:29 PM, William Bell  wrote:

> If you have degree of separation (like friend). You could do something
> like:
>
> ...defType=dismax&bq=degree_of_separation:1^100
>
> Thanks.
>
> On Thu, Apr 5, 2012 at 12:55 AM, Monmohan Singh 
> wrote:
> > Hi,
> > Any inputs or experience that others have come across will be really
> > helpful to know.
> > Basically, its the same as page ranking but the information used to
> decide
> > the rank is much more dynamic in nature..
> > Appreciate any inputs.
> > Regards
> > Monmohan
> >
> > On Wed, Apr 4, 2012 at 4:22 PM, monmohan  wrote:
> >
> >> Hi,
> >> My index is composed of documents with an "author" field. My system is a
> >> users portal where they can have a friend relationship among each other.
> >> When a user searches for documents, I would like to boost score of docs
> in
> >> which  author is friend of the user doing the search. Note that the
> list of
> >> friends for a user can be potentially big and dynamic (changing as the
> user
> >> makes more friends)
> >>
> >> Is there a way to do this kind of boosting at query time? I have looked
> at
> >> External field, query elevator and function queries but it appears that
> >> none
> >> of them
> >>
> >> Since the list of friends for a user is dynamic and per user based, it
> >> can't
> >> really be added as a field in the index for each document so I am not
> >> considering that option at all.
> >> Regards
> >> Monmohan
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/query-time-customized-boosting-tp3883743p3883743.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>


A endless loop in new SolrCloud probably

2012-04-05 Thread Jam Luo
Hi
I deployed a solr cluster,the code version is
"NightlyBuilds apache-solr-4.0-2012-03-19_09-25-37".
Cluster  has 4 nodes named  "A",  "B", "C", "D",  "num_shards=2", A and
C in shard1 ,  B and D in shard2, A and B is the leader of their shard.  It
has ran 2 days, added 20m docs, all of then are OK, but after this ,C
occured a Exception "org.apache.lucene.store.AlreadyClosedException: this
IndexWriter is closed", and jetty on C was not down, node C  exist in
zookeeper in "/live_nodes".   In this time, The A try to ask C to recover,
 but C can't an response,  so A get a Exception, the log is:

Ϣ: try and ask http://node23:8983/solr to recover
 04, 2012 8:02:36 
org.apache.solr.update.processor.DistributedUpdateProcessor doFinish
Ϣ: try and ask http://node23:8983/solr to recover
 04, 2012 8:02:36 
org.apache.solr.update.processor.DistributedUpdateProcessor doFinish
Ϣ: Could not tell a replica to recover
org.apache.solr.client.solrj.SolrServerException: http://node23:8983/solr
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:496)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:251)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:347)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:816)
at
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:176)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1549)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:441)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:262)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:499)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:952)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.ConnectException: ܾ
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method

Re: Solr: Highlighting word parts in excerpt does not work

2012-04-05 Thread Koji Sekiguchi

(12/04/05 15:34), Thomas Werthmüller wrote:

Hi

I configured solr that also word parts are found. When is search "Monday"
or "Mond" the right document is found. This is done with the following
configuration in the schema.xml:.

Now, when I add hl=true to the query sting, the excerpt for "Monday" looks
good and the word is highlighted. When i search only with "Mond", the
document is found but no excerpt is returned because the query sting is not
the whole word.

I hope someone can give me a hint that also excerpts returned with word
parts.

Thanks!
Thomas


Hi Thomas,

Highlighter doesn't support N-gram field, I think. (Or does it support N-gram
field recently?) FastVectorHighlighter does support such fields but 
fixed-gram-size
only, e.g. minGramSize="3" maxGramSize="3".

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/


Large numbers of executeWithRetry INFO messages

2012-04-05 Thread Shubham Srivastava
Hi,

I am getting the below log's

Apr 5, 2012 6:27:59 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) 
caught when processing request: The server 192.168.6.135 failed to respond
Apr 5, 2012 6:27:59 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: Retrying request
Apr 5, 2012 6:28:39 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) 
caught when processing request: The server 192.168.6.135 failed to respond
Apr 5, 2012 6:28:39 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: Retrying request
Apr 5, 2012 6:30:39 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) 
caught when processing request: The server 192.168.6.135 failed to respond
Apr 5, 2012 6:30:39 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: Retrying request
Apr 5, 2012 6:31:59 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) 
caught when processing request: The server 192.168.6.135 failed to respond
Apr 5, 2012 6:31:59 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: Retrying request
Apr 5, 2012 6:32:59 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry
INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) 
caught when processing request: The server 192.168.6.135 failed to respond
Apr 5, 2012 6:32:59 PM org.apache.commons.httpclient.HttpMethodDirector 
executeWithRetry

every now and then and on every slave randomly. However I haven't seen any 
issues with replication of Master-Slave as such , validated with Index Version 
and Generated numbers as well as the data.

I am using solr3.5 with 5Slaves + 1Master. Polling interval being 20seconds and 
docs are updated(delta-import) every 60 seconds through Master.Slaves only are 
for read.

I am running solr with tomacat 6.0.35 and below is the connection settings



Heap size is 1Gb( Xms=Xmx=1024m).

Any pointers what could be wrong.

Regards,
Shubham



Re: SolrCloud replica and leader out of Sync somehow

2012-04-05 Thread Yonik Seeley
On Thu, Apr 5, 2012 at 12:19 AM, Jamie Johnson  wrote:
> Not sure if this got lost in the shuffle, were there any thoughts on this?

Sorting by "id" could be pretty expensive (memory-wise), so I don't
think it should be default or anything.
We also need a way for a client to hit the same set of servers again
anyway (to handle other possible variations like commit time).

To handle the tiebreak stuff, you could also sort by _version_ - that
should be unique in an index and is already used under the covers and
hence shouldn't add any extra memory overhead.  versions increase over
time, so "_version desc" should give you newer documents first.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10




> On Wed, Mar 21, 2012 at 11:02 AM, Jamie Johnson  wrote:
>> Given that in a distributed environment the docids are not guaranteed
>> to be the same across shards should the sorting use the uniqueId field
>> as the tie breaker by default?
>>
>> On Tue, Mar 20, 2012 at 2:10 PM, Yonik Seeley
>>  wrote:
>>> On Tue, Mar 20, 2012 at 2:02 PM, Jamie Johnson  wrote:
 I'll try to dig for the JIRA.  Also I'm assuming this could happen on
 any sort, not just score correct?  Meaning if we sorted by a date
 field and there were duplicates in that date field order wouldn't be
 guaranteed for the same reasons right?
>>>
>>> Correct - internal docid is the tiebreaker for all sorts.
>>>
>>> -Yonik
>>> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
>>> Boston May 7-10


Re: Search for "library" returns 0 results, but search for "marion library" returns many results

2012-04-05 Thread Erik Hatcher
> It looks like somehow the query is getting converted from "library" to
> "librari". Any idea how that would happen?

Yeah, that happens from having stemming involved in your query time analysis 
(look at your field type, you've surely got Snowball in there)

Also, you're using the dismax query parser which has many knobs and dials, and 
this is why things aren't matching as you'd expect.  You'll want to tinker with 
some of those settings, especially if you need query multiple fields with 
varying weights.

Erik



On Apr 4, 2012, at 12:11 , Sean Adams-Hiett wrote:

> Here are some of the XML results with the debug on:
> 
> 
> 
> 
> 
> library
> library
> 
> +DisjunctionMaxQuery((content:librari)~0.01)
> DisjunctionMaxQuery((content:librari^2.0)~0.01)
> 
> +(content:librari)~0.01
> (content:librari^2.0)~0.01
> 
> DisMaxQParser
> 
> 
> 
> 0.0
> 
> 0.0
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 
> 0.0
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 0.0
> 
> 
> 
> 
> 
> 
> It looks like somehow the query is getting converted from "library" to
> "librari". Any idea how that would happen?
> 
> Sean
> 
> On Wed, Apr 4, 2012 at 10:13 AM, Ravish Bhagdev 
> wrote:
> 
>> Yes, can you check if results you get with "marion library" match on marion
>> or library?  By default solr uses OR between words (specified in
>> solrconfig.xml).  You can also easily check this by enabling highlighting.
>> 
>> Ravish
>> 
>> On Wed, Apr 4, 2012 at 4:11 PM, Joshua Sumali  wrote:
>> 
>>> Did you try to append &debugQuery=on to get more information?
>>> 
 -Original Message-
 From: Sean Adams-Hiett [mailto:s...@advantage-companies.com]
 Sent: Wednesday, April 04, 2012 10:43 AM
 To: solr-user@lucene.apache.org
 Subject: Search for "library" returns 0 results, but search for "marion
>>> library"
 returns many results
 
 This is cross posted on Drupal.org: http://drupal.org/node/1515046
 
 Summary: I have a fairly clean install of Drupal 7 with
>>> Apachesolr-1.0-beta18. I
 have created a content type called document with a number of fields. I
>> am
 working with 30k+ records, most of which are related to "Marion, IA" in
>>> some
 way. A search for "library" (without the quotes) returns no results,
>>> while a
 search for "marion library" returns thousands of results. That doesn't
>>> make
 any sense to me at all.
 
 Details:
 
  Drupal 7 (latest stable version)
  Apachesolr-1.0-beta18
  Custom content type with many fields
  LAMP stack running on Centos Linode
  PHP 5.2.x
 
 
 I also checked this through the solr admin interface, running the same
 searches with similar results, so I can't rule out the possibility that
>>> something
 is configured wrong... but since I am using the solrconfig.xml and
>>> schema.xml
 files provided with the modules, it is also a possibility that the
>> issue
>>> lies here
 as well. I have watched the logs and during the searches that produce
>> no
 results but should, there is no output in the log besides the regular
 [INFO] about the query.
 
 I am stumped and I am past a deadline with this project, so any help
>>> would
 be greatly appreciated.
 
 --
 Sean Adams-Hiett
 Director of Development
 The Advantage Companies
 s...@advantage-companies.com
 www.advantage-companies.com
>>> 
>> 
> 
> 
> 
> -- 
> Sean Adams-Hiett
> Owner, Web Geeks For Hire
> phone: (361) 433.5748
> email: s...@webgeeksforhire.com
> twitter: @geekbusiness 



Re: Search for "library" returns 0 results, but search for "marion library" returns many results

2012-04-05 Thread Sean Adams-Hiett
Thanks for all the replies on this. It turns out that the reason that I
wasn't getting the expected results is because I was not properly indexed
one of the fields. My content type display settings for that field were set
to hidden in Drupal. After I corrected this and re-indexed I started
getting the expected results.

Thanks again for all the responses!

Sean

On Thu, Apr 5, 2012 at 10:02 AM, Erik Hatcher wrote:

> > It looks like somehow the query is getting converted from "library" to
> > "librari". Any idea how that would happen?
>
> Yeah, that happens from having stemming involved in your query time
> analysis (look at your field type, you've surely got Snowball in there)
>
> Also, you're using the dismax query parser which has many knobs and dials,
> and this is why things aren't matching as you'd expect.  You'll want to
> tinker with some of those settings, especially if you need query multiple
> fields with varying weights.
>
>Erik
>
>
>
> On Apr 4, 2012, at 12:11 , Sean Adams-Hiett wrote:
>
> > Here are some of the XML results with the debug on:
> >
> > 
> > 
> > 
> > 
> > library
> > library
> > 
> > +DisjunctionMaxQuery((content:librari)~0.01)
> > DisjunctionMaxQuery((content:librari^2.0)~0.01)
> > 
> > +(content:librari)~0.01
> > (content:librari^2.0)~0.01
> > 
> > DisMaxQParser
> > 
> > 
> > 
> > 0.0
> > 
> > 0.0
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 
> > 0.0
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 
> > 
> > 
> >
> > It looks like somehow the query is getting converted from "library" to
> > "librari". Any idea how that would happen?
> >
> > Sean
> >
> > On Wed, Apr 4, 2012 at 10:13 AM, Ravish Bhagdev <
> ravish.bhag...@gmail.com>wrote:
> >
> >> Yes, can you check if results you get with "marion library" match on
> marion
> >> or library?  By default solr uses OR between words (specified in
> >> solrconfig.xml).  You can also easily check this by enabling
> highlighting.
> >>
> >> Ravish
> >>
> >> On Wed, Apr 4, 2012 at 4:11 PM, Joshua Sumali  wrote:
> >>
> >>> Did you try to append &debugQuery=on to get more information?
> >>>
>  -Original Message-
>  From: Sean Adams-Hiett [mailto:s...@advantage-companies.com]
>  Sent: Wednesday, April 04, 2012 10:43 AM
>  To: solr-user@lucene.apache.org
>  Subject: Search for "library" returns 0 results, but search for
> "marion
> >>> library"
>  returns many results
> 
>  This is cross posted on Drupal.org: http://drupal.org/node/1515046
> 
>  Summary: I have a fairly clean install of Drupal 7 with
> >>> Apachesolr-1.0-beta18. I
>  have created a content type called document with a number of fields. I
> >> am
>  working with 30k+ records, most of which are related to "Marion, IA"
> in
> >>> some
>  way. A search for "library" (without the quotes) returns no results,
> >>> while a
>  search for "marion library" returns thousands of results. That doesn't
> >>> make
>  any sense to me at all.
> 
>  Details:
>  
>   Drupal 7 (latest stable version)
>   Apachesolr-1.0-beta18
>   Custom content type with many fields
>   LAMP stack running on Centos Linode
>   PHP 5.2.x
>  
> 
>  I also checked this through the solr admin interface, running the same
>  searches with similar results, so I can't rule out the possibility
> that
> >>> something
>  is configured wrong... but since I am using the solrconfig.xml and
> >>> schema.xml
>  files provided with the modules, it is also a possibility that the
> >> issue
> >>> lies here
>  as well. I have watched the logs and during the searches that produce
> >> no
>  results but should, there is no output in the log besides the regular
>  [INFO] about the query.
> 
>  I am stumped and I am past a deadline with this project, so any help
> >>> would
>  be greatly appreciated.
> 
>  --
>  Sean Adams-Hiett
>  Director of Development
>  The Advantage Companies
>  s...@advantage-companies.com
>  www.advantage-companies.com
> >>>
> >>
> >
> >
> >
> > --
> > Sean Adams-Hiett
> > Owner, Web Geeks For Hire
> > phone: (361) 433.5748
> > email: s...@webgeeksforhire.com
> > twitter: @geekbusiness 
>
>


-- 
Sean Adams-Hiett
Owner, Web Geeks For Hire
phone: (361) 433.5748
email: s...@webgeeksforhire.com
twitter: @geekbusiness 


JSP support not configured

2012-04-05 Thread Joseph Werner
I worked through the Solr tutorial and everthing worked like a charm;
I figured I would go ahead and install Jetty and try to install Solr
and get a functional prototype search engine up.  Unfortunatly, my
Jetty installation seems to be broken:


HTTP ERROR 500

Problem accessing /solr/admin/index.jsp. Reason:

JSP support not configured


Any suggestions here?

--
Best Regards,
[Joseph] Christian Werner Sr


Re: counter field

2012-04-05 Thread Chris Hostetter

: > Is it possible to define a field as "Counter Column" which can be
: > auto-incremented.

a feature like this does not exist in Solr at the moment, but it would be 
possible to implement this fairly easily in an UpdateProcessor -- however 
it would only be functional in very limited situations (ie: all docs must 
use the same update chain, single node indexes only -- no distributed 
search / solr cloud)

a better question is: why?  what do you wnat to do with a field like this?

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341




-Hoss


Re: custom field default qf of requestHandler

2012-04-05 Thread Chris Hostetter
:   
: 
: 
:   
: 
:
:  

i'm pretty sure what you are seeing here is a variation on the "stopwords" 
confusion people tend to have about dismax (and edismax)

just like hte lucene qparser, "whitespace" in the query string is 
significant, and is used to denote the individual clauses of the input, 
which are then *individually* passed to the analysers for each field in 
the qf -- if one of your qf fields produces no tokens for an individual 
clause (in this case: because it is configured not to output unigrams, and 
unigrams is all that it can produce based on only getting one clause at a 
time) then it gets droped out...

http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/

(note in particular the latter half starting with "Where people tend to 
get tripped up, is in thinking about how Solr’s per-field analysis 
configuration...") 

if you quoted some portion of hte input, then the entire quoted portion 
would be treated as a single clause and passed to your analyser.

altenatly: if you used thta field in the "pf" (where the entire input is 
treated as one phrase) you would also start to see some shingles i believe


-Hoss

upgrade solr from 1.4 to 3.5 not working

2012-04-05 Thread Robert Petersen
Hi folks, I'm a little stumped here.

 

I have an existing solr 1.4 setup which is well configured.  I want to
upgrade to the latest solr release, and after reading release notes, the
wiki, etc, I concluded the correct path would be to not change any
config items and just replace the solr.war file in tomcats webapps
folder with the new one and then start tomcat back up.

 

This worked fine, solr came up.  The problem is that on the solr info
page it still says that I am running solr 1.4 even after several
restarts and even a server reboot.  Am I missing something?  Info says
this though there is no solr 1.4 war file anywhere under tomcat root:

 

Solr Specification Version: 1.4.0.2009.12.10.10.34.34

Solr Implementation Version: 1.4 exported - sam - 2009-12-10
10:34:34

Lucene Specification Version: 2.9.1

Lucene Implementation Version: 2.9.1 exported - 2009-12-10
10:32:14

Current Time: Thu Apr 05 12:56:12 PDT 2012

Server Start Time:Thu Apr 05 12:52:25 PDT 2012

 

Any help would be appreciated.

Thanks

Robi



Re: How to return a result with multiple query?

2012-04-05 Thread Erick Erickson
the default query operator is pretty much ignored with (e)dismax
style parsers. You can get there by varying the "mm" parameter.

See: 
http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

Best
Erick

On Tue, Apr 3, 2012 at 10:58 PM, neosky  wrote:
> 1.I did 5 gram token in my sequence field, and I search as the following
>
> http://192.168.52.137:8983/solr/select?indent=on&defType=dismax&version=2.2&q=sequence:N%20sequence:N%20sequence:G&fq=&start=0&rows=10&fl=*,score&qt=&wt=&explainOther=&hl=on&hl.fl=sequence
>
> I want to return a document with "N" & "N" & "G" at the same
> time(the first and the second queries are the same)
> most of cases work well, except one without "G" also is returned
> 2. I can't remember which to define the operation behind the "q"  is "AND"
> but not "OR"
> Actually I modified it somewhere before, but I can't remember now.
> I use "defType=edismax" here
>
> 3. I also tried the defType=dismax, but it return no any result even I just
> query one parameter
> http://192.168.52.137:8983/solr/select?indent=on&version=2.2&defType=dismax&q=N&fq=&start=0&rows=10&fl=*,score&qt=&wt=&explainOther=&hl=on&hl.fl=sequence
> It should turn many candidates.Why dismax doesn't work with Ngram?
>
> I am a little overwhelmed.Thanks in advance
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-return-a-result-with-multiple-query-tp3883049p3883049.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Choosing tokenizer based on language of document

2012-04-05 Thread Erick Erickson
This is really difficult to imagine working well. Even if you
do choose the appropriate analysis chain (and it must
be a chain here), and manage to appropriately tokenize
for each language, what happens at query time?

How do you expect to get matches on, say, Ukranian when
the tokens of the query are in Erse?

This feels like an XY problem, can you explain at a
higher level what your requirements are?

Best
Erick

On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
 wrote:
> Hi,
>      I have documents in different languages and I want to choose the 
> tokenizer to use for a document based on the language of the document. The 
> language of the document is already known and is indexed in a field. What I 
> want to do is when I index the text in the document, I want to choose the 
> tokenizer to use based on the value of the language field. I want to use one 
> field for the text in the document (defining multiple fields for each 
> language is not an option). It seems like I can define a tokenizer for a 
> field, so I guess what I need to do is to write a custom tokenizer that looks 
> at the language field value of the document and calls the appropriate 
> tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer 
> for CJK languages etc..). From whatever I have read, it seems quite straight 
> forward to write a custom tokenizer, but how would this custom tokenizer know 
> the language of the document? Is there some way I can pass in this value to 
> the tokenizer? Or is there some way the tokenizer will have access to other 
> fields in the document?. Would be really helpful if someone can provide an 
> answer
>
> Thanks
> Prabhu


Re: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)

2012-04-05 Thread Erick Erickson
Solr version? I suspect your outlier is due to merging
segments, if so this should have happened quite some time
into the run. See Simon Wilnauer's blog post on
DocumenWriterPerThread (trunk) code.

What commitWithin time are you using?


Best
Erick

On Wed, Apr 4, 2012 at 7:50 PM, Mike O'Leary  wrote:
> I am indexing some database contents using add(docs, commitWithinMs), and 
> those add calls are taking over 80% of the time once the database begins 
> returning results. I was wondering if setting waitSearcher to false would 
> speed this up. Many of the calls take 1 to 6 seconds, with one outlier that 
> took over 11 minutes.
> Thanks,
> Mike
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Wednesday, April 04, 2012 4:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, 
> commitWithinMs)
>
>
> On Apr 4, 2012, at 6:50 PM, Mike O'Leary wrote:
>
>> If you index a set of documents with SolrJ and use
>> StreamingUpdateSolrServer.add(Collection docs, int
>> commitWithinMs), it will perform a commit within the time specified, and it 
>> seems to use default values for waitFlush and waitSearcher.
>>
>> Is there a place where you can specify different values for waitFlush
>> and waitSearcher, or if you want to use different values do you have
>> to call StreamingUpdateSolrServer.add(Collection
>> docs) and then call StreamingUpdateSolrServer.commit(waitFlush, 
>> waitSearcher) explicitly?
>> Thanks,
>> Mike
>
>
> waitFlush actually does nothing in recent versions of Solr. waitSearcher 
> doesn't seem so important when the commit is not done explicitly by the user 
> or a client.
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


Re: Is there any performance cost of using lots of OR in the solr query

2012-04-05 Thread Erick Erickson
Of course putting more clauses in an OR query will
have a performance cost, there's more work to do

OK, being a smart-alec aside you will probably
be fine with a few hundred clauses. The question
is simply whether the performance hit is acceptable.
I'm afraid that question can't be answered in the
abstract, you'll have to test...

Since you're putting them in an fq, there's also some chance
that they'll be re-used from the cache, at least if there
are common patterns.

Best
Erick

On Wed, Apr 4, 2012 at 8:05 PM, roz dev  wrote:
> Hi All,
>
> I am working on an application which makes few solr calls to get the data.
>
> On the high level, We have a requirement like this
>
>
>   - Make first call to Solr, to get the list of products which are
>   children of a given category
>   - Make 2nd solr call to get product documents based on a list of product
>   ids
>
> 2nd query will look like
>
> q=document_type:SKU&fq=product_id:(34 OR 45 OR 56 OR 77)
>
> We can have close to 100 product ids in fq.
>
> is there a performance cost of doing these solr calls which have lots of OR?
>
> As per Slide # 41 of Presentation "The Seven Deadly Sins of Solr", it is a
> bad idea to have these kind of queries.
>
> http://www.slideshare.net/lucenerevolution/hill-jay-7-sins-of-solrpdf
>
> But, It does not become clear the reason it is bad.
>
> Any inputs will be welcome.
>
> Thanks
>
> Saroj


Re: It cost some many memory with solrj 3.5 & how to decrease it?

2012-04-05 Thread Erick Erickson
"What's memory"? Really, how are you measuring it?

If it's virtual, you don't need to worry about it. Is this
causing you a real problem or are you just nervous about
the difference?

Best
Erick

On Wed, Apr 4, 2012 at 11:23 PM, a sd  wrote:
> hi,all.
>    I have write a program which send data to solr using the "update"
> request handler, when i adopted server & client library ( namely solrj )
> with version 4.0 or 3.2, jvm`s heap size was up to 1.0 G about, but ,when i
> transfer the all of them to solr 3.5 ( both server and client libs), the
> size of heap was top to 3.0G ! There are the same server configuration and
> the identical program. What`s wrong with the new version of solrj 3.5 , i
> had looked the source code, there is no difference between solrj 3.2 and
> solrj 3.5 where my program may invoke. How can i do to decrease the memory
> cost by solrj 3.5?
>   Any advice will be appreciated!
>  murphy


How to store the secondary results from the Solr

2012-04-05 Thread neosky
Because the first query result doesn't meet my requirement
I have to do a secondary process manually based on the first query full
results.
Only after I finish the secondary process, I begin to show it to the end
user based on specific records(for instance like the Solr does 10 records a
time)
one option is that I write them in a temporary file then read it again
another option is I keep them in the memory then output when the user needs.
If I want to keep them in memory, what kind of data structure can I use to
store?
Or you have this experience, would you share with me?
Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-store-the-secondary-results-from-the-Solr-tp324p324.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)

2012-04-05 Thread Mike O'Leary
First of all, what I was seeing was different from what I thought I was seeing 
because a few weeks ago I uncommented the  block in the 
solrconfig.xml file and I didn't realize it until yesterday just before I went 
home, so that was controlling the commits more than the add and commit calls 
that I was making. When I commented that block out again, the times for index 
with add(docs, commitWithinMs) and with add(docs) and commit(false, false) were 
very similar. Both of them were about 20 minutes faster (38 minutes instead of 
about an hour) than indexing with  set to commit after every 1,000 
documents or fifteen minutes.

Is this the blog post you are talking about: 
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/?
 It seems to be about the right topic.

I am using Solr 3.5. The feature matrix on one of the Lucid Imagination web 
pages says that DocumentWriterPerThread is available in Solr 4.0 and LucidWorks 
2.0. I assume that means LucidWorks Enterprise. Is that right?
Thanks,
Mike

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, April 05, 2012 2:45 PM
To: solr-user@lucene.apache.org
Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, 
commitWithinMs)

Solr version? I suspect your outlier is due to merging segments, if so this 
should have happened quite some time into the run. See Simon Wilnauer's blog 
post on DocumenWriterPerThread (trunk) code.

What commitWithin time are you using?


Best
Erick

On Wed, Apr 4, 2012 at 7:50 PM, Mike O'Leary  wrote:
> I am indexing some database contents using add(docs, commitWithinMs), and 
> those add calls are taking over 80% of the time once the database begins 
> returning results. I was wondering if setting waitSearcher to false would 
> speed this up. Many of the calls take 1 to 6 seconds, with one outlier that 
> took over 11 minutes.
> Thanks,
> Mike
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Wednesday, April 04, 2012 4:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, 
> commitWithinMs)
>
>
> On Apr 4, 2012, at 6:50 PM, Mike O'Leary wrote:
>
>> If you index a set of documents with SolrJ and use 
>> StreamingUpdateSolrServer.add(Collection docs, int 
>> commitWithinMs), it will perform a commit within the time specified, and it 
>> seems to use default values for waitFlush and waitSearcher.
>>
>> Is there a place where you can specify different values for waitFlush 
>> and waitSearcher, or if you want to use different values do you have 
>> to call StreamingUpdateSolrServer.add(Collection
>> docs) and then call StreamingUpdateSolrServer.commit(waitFlush, 
>> waitSearcher) explicitly?
>> Thanks,
>> Mike
>
>
> waitFlush actually does nothing in recent versions of Solr. waitSearcher 
> doesn't seem so important when the commit is not done explicitly by the user 
> or a client.
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


EmbeddedSolrServer and StreamingUpdateSolrServer

2012-04-05 Thread pcrao
Hi,

I am using EmbeddedSolrServer for full indexing (Multi core)
and StreamingUpdateSolrServer for incremental indexing. 
The steps involved are mentioned below.

Full indexing (Daily)
1) Start EmbeddedSolrServer 
2) Delete all docs
3) Add all docs
4) Commit and optimize collection
5) Stop EmbeddedSolrServer 
6) Reload core
http://localhost:7070/solr/admin/cores?action=RELOAD&core=docs

Incremental Indexing (Hourly)
1) Start StreamingUpdateSolrServer 
2) Add/Delete docs
3) Commit collection

Now, the issue is the index is getting corrupted if we do Full indexing
and incremental indexing one after the other without restarting
the Tomcat web server (localhost:7070). There is no issue if we
restart the Tomcat after each of the indexing processes (Full and
incremental).

Please let me know how can we avoid corrupting the index without restarting
the
Tomcat. I am fairly new to Solr, so I may be missing something here.

Below are some details about our Solr Installation.
1) JVM   OpenJDK 64-Bit Server VM (19.0-b09)
2) solr-spec-version 4.0.0.2011.12.08.06.33.52
3) solr-impl-version  4.0-SNAPSHOT 1211898 - root - 2011-12-08 06:33:52
4) lucene-spec-version 4.0-SNAPSHOT
5) lucene-impl-version  4.0-SNAPSHOT 1211898 - root - 2011-12-08 06:24:12
6) OSRed Hat Enterprise Linux Server release 6.1
(Santiago)

Thanks,
PC Rao.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-and-StreamingUpdateSolrServer-tp3889073p3889073.html
Sent from the Solr - User mailing list archive at Nabble.com.


schema design question

2012-04-05 Thread N. Tucker
Apologies if this is a very straightforward schema design problem that
should be fairly obvious, but I'm not seeing a good way to do it.
Let's say I have an index that wants to model Albums and Tracks, and
they all have arbitrary tags attached to them (represented by
multivalue string type fields).  Tracks also have an album id field
which can be used to associate them with an album.  I'd like to
perform a query which shows both Track and Album results, but
suppresses Tracks that are associated with Albums in the result set.

I am tempted to use a "join" here, but I have reservations because it
is my understanding that joins cannot work across shards, and I'm not
sure it's a good idea to limit myself in that way if possible.  Any
suggestions?  Is there a standard solution to this type of problem
where you've got hierarchical items and you don't want children shown
in the same result as the parent?


A little onfusion with maxPosAsterisk

2012-04-05 Thread neosky
maxPosAsterisk - maximum position (1-based) of the asterisk wildcard ('*')
that triggers the reversal of query term. Asterisk that occurs at positions
higher than this value will not cause the reversal of query term. Defaults
to 2, meaning that asterisks on positions 1 and 2 will cause a reversal.

I can't understand that "will cause a reversal." 
I know the Solr will keep the original token and reverse token when
withOriginal parameter is open
Does that means the searcher will use the reverse one to help to process the
query when "cause a reversal"? 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/A-little-onfusion-with-maxPosAsterisk-tp3889226p3889226.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: It cost some many memory with solrj 3.5 & how to decrease it?

2012-04-05 Thread a sd
hi,Erick.
thanks at first.
I had watched the status of JVM at  runtime helped by "jconsole" and "jmap".
1,When the "Xmx" was not assigned, then, the "Old Gen" area was full whose
size was up to 1.5Gb and whose major content are instances of "String" ,
when the whole size of heap was up to the maximum ( about 2GB), the JVM run
gc() ,which wasted the CPU time,then, the performance was degraded sharply,
which was from 100,000 docs per minute to 10,000 docs per minute, as a
examination, i assigned "Xmx=1024m" purposely, the amount was down to 1000
docs per minute.
2,When assigned "Xmx=4096m", i found that the "Old Gen" was up to 2.1 GB
and the all size of JVM was up to 3GB, but, the performance with 100,000
docs per minute can attained.
During all of the test above, i only adjust the setting of client, which
connect to the identical solr server and i empty the "data" directory of
solr home before every test.
By the way, i know the client code was very ugly occupied so many heap too,
but, i wan`t permitted to promote them before i obtain a benchrank using
solrj 3.5 as much as which the old version did using solrj 1.4.
B.R
murphy

On Fri, Apr 6, 2012 at 5:54 AM, Erick Erickson wrote:

> "What's memory"? Really, how are you measuring it?
>
> If it's virtual, you don't need to worry about it. Is this
> causing you a real problem or are you just nervous about
> the difference?
>
> Best
> Erick
>
> On Wed, Apr 4, 2012 at 11:23 PM, a sd  wrote:
> > hi,all.
> >I have write a program which send data to solr using the "update"
> > request handler, when i adopted server & client library ( namely solrj )
> > with version 4.0 or 3.2, jvm`s heap size was up to 1.0 G about, but
> ,when i
> > transfer the all of them to solr 3.5 ( both server and client libs), the
> > size of heap was top to 3.0G ! There are the same server configuration
> and
> > the identical program. What`s wrong with the new version of solrj 3.5 , i
> > had looked the source code, there is no difference between solrj 3.2 and
> > solrj 3.5 where my program may invoke. How can i do to decrease the
> memory
> > cost by solrj 3.5?
> >   Any advice will be appreciated!
> >  murphy
>


Re: A tool for frequent re-indexing...

2012-04-05 Thread Ahmet Arslan
> I am considering writing a small tool that would read from
> one solr core
> and write to another as a means of quick re-indexing of
> data.  I have a
> large-ish set (hundreds of thousands) of documents that I've
> already parsed
> with Tika and I keep changing bits and pieces in schema and
> config to try
> new things often.  Instead of having to go through the
> process of
> re-indexing from docs (and some DBs), I thought it may be
> much more faster
> to just read from one core and write into new core with new
> schema, analysers and/or settings.
> 
> I was wondering if anyone else has done anything similar
> already?  It would
> be handy if I can use this sort of thing to spin off another
> core write to
> it and then swap the two cores discarding the older one.

You might find these relevant :

https://issues.apache.org/jira/browse/SOLR-3246

http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor




Re: counter field

2012-04-05 Thread Manish Bafna
We need to have a document id available for every document (Per core).
There is DocID in Lucene Index but did not find any API to expose it using
Solr.

May be if we can alter Solr to optionally return the DocId (which is
unique),

We can pass docid as one of the parameter for fq, and it will return the
docid in the search result.

On Thu, Apr 5, 2012 at 10:13 PM, Chris Hostetter
wrote:

>
> : > Is it possible to define a field as "Counter Column" which can be
> : > auto-incremented.
>
> a feature like this does not exist in Solr at the moment, but it would be
> possible to implement this fairly easily in an UpdateProcessor -- however
> it would only be functional in very limited situations (ie: all docs must
> use the same update chain, single node indexes only -- no distributed
> search / solr cloud)
>
> a better question is: why?  what do you wnat to do with a field like this?
>
> https://people.apache.org/~hossman/#xyproblem
> XY Problem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
>
>
> -Hoss
>


Re: counter field

2012-04-05 Thread Chris Hostetter

: We need to have a document id available for every document (Per core).

: We can pass docid as one of the parameter for fq, and it will return the
: docid in the search result.


So it sounds like you need a *unique* id, but nothing you described 
requies that it be a counter.

Take a look at the UUIDField, or consider using the 
SignatureUpdateProcessor to generate a key based on a hash of all the 
field values.

-Hoss


Re: counter field

2012-04-05 Thread Manish Bafna
We already have a unique key (We use md5 value).
We need another id (sequential numbers).

On Fri, Apr 6, 2012 at 9:47 AM, Chris Hostetter wrote:

>
> : We need to have a document id available for every document (Per core).
>
> : We can pass docid as one of the parameter for fq, and it will return the
> : docid in the search result.
>
>
> So it sounds like you need a *unique* id, but nothing you described
> requies that it be a counter.
>
> Take a look at the UUIDField, or consider using the
> SignatureUpdateProcessor to generate a key based on a hash of all the
> field values.
>
> -Hoss
>


Re: counter field

2012-04-05 Thread Walter Underwood
Why?

When you reindex, is it OK if they all change?

If you reindex one document, is it OK if it gets a new sequential number?

wunder

On Apr 5, 2012, at 9:23 PM, Manish Bafna wrote:

> We already have a unique key (We use md5 value).
> We need another id (sequential numbers).
> 
> On Fri, Apr 6, 2012 at 9:47 AM, Chris Hostetter 
> wrote:
> 
>> 
>> : We need to have a document id available for every document (Per core).
>> 
>> : We can pass docid as one of the parameter for fq, and it will return the
>> : docid in the search result.
>> 
>> 
>> So it sounds like you need a *unique* id, but nothing you described
>> requies that it be a counter.
>> 
>> Take a look at the UUIDField, or consider using the
>> SignatureUpdateProcessor to generate a key based on a hash of all the
>> field values.
>> 
>> -Hoss
>> 







Re: counter field

2012-04-05 Thread Manish Bafna
Actually not.
If i am updating the existing document, i need to keep the old number
itself.

may be this way we can do it.
If we pass the number to the field, it will take that value, if we dont
pass it, it will do auto-increment.
Because if we update, i will have old number and i will pass it as a field
again.

On Fri, Apr 6, 2012 at 9:59 AM, Walter Underwood wrote:

> Why?
>
> When you reindex, is it OK if they all change?
>
> If you reindex one document, is it OK if it gets a new sequential number?
>
> wunder
>
> On Apr 5, 2012, at 9:23 PM, Manish Bafna wrote:
>
> > We already have a unique key (We use md5 value).
> > We need another id (sequential numbers).
> >
> > On Fri, Apr 6, 2012 at 9:47 AM, Chris Hostetter <
> hossman_luc...@fucit.org>wrote:
> >
> >>
> >> : We need to have a document id available for every document (Per core).
> >>
> >> : We can pass docid as one of the parameter for fq, and it will return
> the
> >> : docid in the search result.
> >>
> >>
> >> So it sounds like you need a *unique* id, but nothing you described
> >> requies that it be a counter.
> >>
> >> Take a look at the UUIDField, or consider using the
> >> SignatureUpdateProcessor to generate a key based on a hash of all the
> >> field values.
> >>
> >> -Hoss
> >>
>
>
>
>
>
>


Re: counter field

2012-04-05 Thread Walter Underwood
So you will need to do a search for each document before adding it to the 
index, in case it is already there. That will be slow.

And where do you store the last-assigned number?

And there are plenty of other problems, like reloading after a corrupted index 
(disk failure), or deleted documents which are re-added later, or duplicates, 
splitting content across shards (requires a global lock across all shards to 
index each document), ...

Two recommendations:

1. Having two different unique IDs is likely to cause problems, so choose one.

2. If you must have two IDs, use one table in a lightweight relational database 
to store the relationships between the md5 value and the serial number.

wunder

On Apr 5, 2012, at 9:37 PM, Manish Bafna wrote:

> Actually not.
> If i am updating the existing document, i need to keep the old number
> itself.
> 
> may be this way we can do it.
> If we pass the number to the field, it will take that value, if we dont
> pass it, it will do auto-increment.
> Because if we update, i will have old number and i will pass it as a field
> again.
> 
> On Fri, Apr 6, 2012 at 9:59 AM, Walter Underwood wrote:
> 
>> Why?
>> 
>> When you reindex, is it OK if they all change?
>> 
>> If you reindex one document, is it OK if it gets a new sequential number?
>> 
>> wunder
>> 
>> On Apr 5, 2012, at 9:23 PM, Manish Bafna wrote:
>> 
>>> We already have a unique key (We use md5 value).
>>> We need another id (sequential numbers).
>>> 
>>> On Fri, Apr 6, 2012 at 9:47 AM, Chris Hostetter <
>> hossman_luc...@fucit.org>wrote:
>>> 
 
 : We need to have a document id available for every document (Per core).
 
 : We can pass docid as one of the parameter for fq, and it will return
>> the
 : docid in the search result.
 
 
 So it sounds like you need a *unique* id, but nothing you described
 requies that it be a counter.
 
 Take a look at the UUIDField, or consider using the
 SignatureUpdateProcessor to generate a key based on a hash of all the
 field values.
 
 -Hoss
 
>> 
>> 
>> 
>> 
>> 
>> 

--
Walter Underwood
wun...@wunderwood.org





Re: counter field

2012-04-05 Thread Manish Bafna
Yes, before indexing, we go and check whether that document is already
there in index or not.
Because along with the document, we also have meta-data information which
needs to be appended.

So, we have few multivalued metadata fields, which we update if the same
document is found again.


On Fri, Apr 6, 2012 at 10:17 AM, Walter Underwood wrote:

> So you will need to do a search for each document before adding it to the
> index, in case it is already there. That will be slow.
>
> And where do you store the last-assigned number?
>
> And there are plenty of other problems, like reloading after a corrupted
> index (disk failure), or deleted documents which are re-added later, or
> duplicates, splitting content across shards (requires a global lock across
> all shards to index each document), ...
>
> Two recommendations:
>
> 1. Having two different unique IDs is likely to cause problems, so choose
> one.
>
> 2. If you must have two IDs, use one table in a lightweight relational
> database to store the relationships between the md5 value and the serial
> number.
>
> wunder
>
> On Apr 5, 2012, at 9:37 PM, Manish Bafna wrote:
>
> > Actually not.
> > If i am updating the existing document, i need to keep the old number
> > itself.
> >
> > may be this way we can do it.
> > If we pass the number to the field, it will take that value, if we dont
> > pass it, it will do auto-increment.
> > Because if we update, i will have old number and i will pass it as a
> field
> > again.
> >
> > On Fri, Apr 6, 2012 at 9:59 AM, Walter Underwood  >wrote:
> >
> >> Why?
> >>
> >> When you reindex, is it OK if they all change?
> >>
> >> If you reindex one document, is it OK if it gets a new sequential
> number?
> >>
> >> wunder
> >>
> >> On Apr 5, 2012, at 9:23 PM, Manish Bafna wrote:
> >>
> >>> We already have a unique key (We use md5 value).
> >>> We need another id (sequential numbers).
> >>>
> >>> On Fri, Apr 6, 2012 at 9:47 AM, Chris Hostetter <
> >> hossman_luc...@fucit.org>wrote:
> >>>
> 
>  : We need to have a document id available for every document (Per
> core).
> 
>  : We can pass docid as one of the parameter for fq, and it will return
> >> the
>  : docid in the search result.
> 
> 
>  So it sounds like you need a *unique* id, but nothing you described
>  requies that it be a counter.
> 
>  Take a look at the UUIDField, or consider using the
>  SignatureUpdateProcessor to generate a key based on a hash of all the
>  field values.
> 
>  -Hoss
> 
> >>
> >>
> >>
> >>
> >>
> >>
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>
>