SOLR 4 getting stuck during restart

2013-01-19 Thread vijeshnair
I have my index based spell checker configured, and the select request
handlers are configured with collation i.e. true

For my testing I have indexed 2 million records there after generated the
index based dictionary (I am evaluating the DirectSpellChecker, I am seeing
my memory consumption is more when I use DirectSpellChecker).  Now I wanted
modify some threshold parameters in the config file for the new spellchecker
component. So when I restarted my tomcat, the tomcat re-start is getting
stuck at the following line

INFO: QuerySenderListener sending requests to Searcher@332b9f79
main{StandardDirectoryReader(segments_1f:281 _2x(4.0.0.2):C1653773)}

Any comments, am I missing some thing or some miss configuration. Please
help.

My temporary work around :- removed the index based dictionary which was
created before, and restarted. I will regenerate the dictionary now.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-getting-stuck-during-restart-tp4034734.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sorting the search results based on number of highlights

2013-01-19 Thread Naresh
Can you set omitNorms to true in the fieldType and see if that helps? Check
http://wiki.apache.org/solr/SchemaXml#Common_field_options

On Fri, Jan 18, 2013 at 4:37 AM, wwhite1133  wrote:

> Sorry, I did not understand what you mean. Can you pl. elaborate..
> Thanks
> WW
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Sorting-the-search-results-based-on-number-of-highlights-tp4031175p4034348.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards
Naresh


Re: n values in one fieldType

2013-01-19 Thread blopez
I'll always query on the set of 6 values, but in some cases, the matching
doesn't need to be exact. 

I mean, an usual query (you know, 6 integer values) could be exact matching
for the first 4 values, but then a range for the other 2 values.

What do u think would be the best way to face it?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/n-values-in-one-fieldType-tp4034552p4034737.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: n values in one fieldType

2013-01-19 Thread Upayavira
Could you not just have six different fields? 

If you wanted greater search efficiency, maybe you could try indexing
them as described above, as strings. You could use the 'shingles' idea,
for example, if you had: 11 22 33 44 55 66 as your numbers, index the
terms:

11
11-22
11-22-33
11-22-33-44
11-22-33-44-55
11-22-33-44-55-66

as six values of a multivalued field. That way you can do an exact match
on the first four values. If you then had the individual fields indexed
as number-1 through number-6 you could do a search for 11-22-33-44 and a
range query on number-5 and another on number-6.

I'm not sure whether that would be more, or less performant than just
having six numeric fields, though.

Upayavira


On Sat, Jan 19, 2013, at 12:44 PM, blopez wrote:
> I'll always query on the set of 6 values, but in some cases, the matching
> doesn't need to be exact. 
> 
> I mean, an usual query (you know, 6 integer values) could be exact
> matching
> for the first 4 values, but then a range for the other 2 values.
> 
> What do u think would be the best way to face it?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/n-values-in-one-fieldType-tp4034552p4034737.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sorting the search results based on number of highlights

2013-01-19 Thread Upayavira
Are you asking to sort on the number of matching terms?

Not an answer, but hopefully giving you pointers...

Firstly, highlighting is done in the highlighting component, which comes
after the QueryComponent in the list of search components invoked.
Sorting happens within the QueryComponent, which therefore cannot make
use of information not yet derived, which makes your request challenging
right from the start.

A sensible way to implement sort here is to implement a function query
(which I think you can do using a ValueSourceParser in 4.0). You can
then sort on that function. So, the question is how can you derive a
numerical value from a document that will represent your sort
preference, before highlighting has actually happened, and how can you
do so quickly enough, given that the value will have to be calculated
for every document matching your query, not just for the ones returned
to your user.

HTH

Upayavira

On Mon, Jan 7, 2013, at 07:27 AM, wwhite1133 wrote:
> Hi ,
>  I wanted to sort the results of the solr search query on the number of
> highlights generated per document. 
> e.g  
> Doc 1 
> highlights {
> fieldA
> FieldB
> }
> Doc 2
> Highlights{
> field A 
> fieldC
> fieldC
> }
> 
> No, I understand that score is calculated depending on many factors like
> ,
> tf , idf boost etc. So when I sort on score the the document with 2
> highlights can come ahead of the document with 3 highlights.
> How can I sort based on purely the number of highlights. 
> Right now I am doing it in the code but 
> I do not want to do this in code  as then the pagination becomes an issue
> .
> I have to fetch a larger number of records and do in memory sorting . 
> 
> Any help is appreciated.
> Thanks 
> WW
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Sorting-the-search-results-based-on-number-of-highlights-tp4031175.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr cache considerations

2013-01-19 Thread Isaac Hebsh
Ok. Thank you everyone for your helpful answers.
I understand that fieldValueCache is not used for resolving queries.
Is there any cache that can help this basic scenario (a lot of different
queries, on a small set of fields)?
Does Lucene's FieldCache help (implicitly)?
How can I use RAM to reduce I/O in this type of queries?


On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe <
tomasflo...@gmail.com> wrote:

> No, the fieldValueCache is not used for resolving queries. Only for
> multi-token faceting and apparently for the stats component too. The
> document cache maintains in memory the stored content of the fields you are
> retrieving or highlighting on. It'll hit if the same document matches the
> query multiple times and the same fields are requested, but as Eirck said,
> it is important for cases when multiple components in the same request need
> to access the same data.
>
> I think soft committing every 10 minutes is totally fine, but you should
> hard commit more often if you are going to be using transaction log.
> openSearcher=false will essentially tell Solr not to open a new searcher
> after the (hard) commit, so you won't see the new indexed data and caches
> wont be flushed. openSearcher=false makes sense when you are using
> hard-commits together with soft-commits, as the "soft-commit" is dealing
> with opening/closing searchers, you don't need hard commits to do it.
>
> Tomás
>
>
> On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh 
> wrote:
>
> > Unfortunately, it seems (
> > http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that
> > these caches are not per-segment. In this case, I want to (soft) commit
> > less frequently. Am I right?
> >
> > Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I
> > guess it has a big contribution to standard (not only faceted) queries
> > time. SolrWiki claims that it primarily used by faceting. What that says
> > about complex textual queries?
> >
> > documentCache:
> > Erick, After a query processing is finished, doesn't some documents stay
> in
> > the documentCache? can't I use it to accelerate queries that should
> > retrieve stored fields of documents? In this case, a big documentCache
> can
> > hold more documents..
> >
> > About commit frequency:
> > HardCommit: "openSearch=false" seems as a nice solution. Where can I read
> > about this? (found nothing but one unexplained sentence in SolrWiki).
> > SoftCommit: In my case, the required index freshness is 10 minutes. The
> > plan to soft commit every 10 minutes is similar to storing all of the
> > documents in a queue (outside to Solr), an indexing a bulk every 10
> > minutes.
> >
> > Thanks.
> >
> >
> > On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe <
> > tomasflo...@gmail.com> wrote:
> >
> > > I think fieldValueCache is not per segment, only fieldCache is.
> However,
> > > unless I'm missing something, this cache is only used for faceting on
> > > multivalued fields
> > >
> > >
> > > On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in
> > > > cache). Notice the /8. This reflects the fact that the filters are
> > > > represented by a bitset on the _internal_ Lucene ID. UniqueId has no
> > > > bearing here whatsoever. This is, in a nutshell, why warming is
> > > > required, the internal Lucene IDs may change. Note also that it's
> > > > maxDoc, the internal arrays have "holes" for deleted documents.
> > > >
> > > > Note this is an _upper_ bound, if there are only a few docs that
> > > > match, the size will be (num of matching docs) * sizeof(int)).
> > > >
> > > > fieldValueCache. I don't think so, although I'm a bit fuzzy on this.
> > > > It depends on whether these are "per-segment" caches or not. Any "per
> > > > segment" cache is still valid.
> > > >
> > > > Think of documentCache as intended to hold the stored fields while
> > > > various components operate on it, thus avoiding repeatedly fetching
> > > > the data from disk. It's _usually_ not too big a worry.
> > > >
> > > > About hard-commits once a day. That's _extremely_ long. Think instead
> > > > of committing more frequently with openSearcher=false. If nothing
> > > > else, you transaction log will grow lots and lots and lots. I'm
> > > > thinking on the order of 15 minutes, or possibly even much less. With
> > > > softCommits happening more often, maybe every 15 seconds. In fact,
> I'd
> > > > start out with soft commits every 15 seconds and hard commits
> > > > (openSearcher=false) every 5 minutes. The problem with hard commits
> > > > being once a day is that, if for any reason the server is
> interrupted,
> > > > on startup Solr will try to replay the entire transaction log to
> > > > assure index integrity. Not to mention that your tlog will be huge.
> > > > Not to mention that there is some memory usage for each document in
> > > > the tlog. Hard commits roll over the t

Have the SolrCloud collection REST endpoints move or changed for 4.1?

2013-01-19 Thread Brett Hoerner
I was using Solr 4.0 but ran into a few problems using SolrCloud. I'm
trying out 4.1 RC1 right now but the update URL I used to use is returning
HTTP 404.

For example, I would post my document updates to,

http://localhost:8983/solr/collection1

But that is 404ing now (collection1 exists according to the admin UI, all
shards are green and happy, and data dirs exist on the nodes).

I also tried the following,

http://localhost:8983/solr/collection1/update

And also received a 404 there.

A specific example from the Java client:

22:38:12.474 [pool-7-thread-14] ERROR com.massrel.faassolr.SolrBackend -
Error while flushing to Solr.
org.apache.solr.common.SolrException: Server at
http://backfill-2d.i.massrel.com:8983/solr/15724/update returned non ok
status:404, message:Not Found
 at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
 at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:438)
~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]

But I can hit that URL with a GET,

$ curl http://backfill-1d.i.massrel.com:8983/solr/15724/update


4002missing content
stream400


Thoughts?

Thanks.


Re: Have the SolrCloud collection REST endpoints move or changed for 4.1?

2013-01-19 Thread Brett Hoerner
I'm actually wondering if this other issue I've been having is a problem:

https://issues.apache.org/jira/browse/SOLR-4321

The fact that some nodes don't "get" pieces of a collection could explain
the 404.

That said, even when a node has "parts" of a collection it reports 404
sometimes. What's odd is that I can use curl to post a JSON document to the
same URL and it will return 200.

When I log every request I make from my indexer process (using solr4j) it's
about 50/50 between 404 and 200...


On Sat, Jan 19, 2013 at 5:22 PM, Brett Hoerner wrote:

> I was using Solr 4.0 but ran into a few problems using SolrCloud. I'm
> trying out 4.1 RC1 right now but the update URL I used to use is returning
> HTTP 404.
>
> For example, I would post my document updates to,
>
> http://localhost:8983/solr/collection1
>
> But that is 404ing now (collection1 exists according to the admin UI, all
> shards are green and happy, and data dirs exist on the nodes).
>
> I also tried the following,
>
> http://localhost:8983/solr/collection1/update
>
> And also received a 404 there.
>
> A specific example from the Java client:
>
> 22:38:12.474 [pool-7-thread-14] ERROR com.massrel.faassolr.SolrBackend -
> Error while flushing to Solr.
> org.apache.solr.common.SolrException: Server at
> http://backfill-2d.i.massrel.com:8983/solr/15724/update returned non ok
> status:404, message:Not Found
>  at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
>  at
> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:438)
> ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44]
>
> But I can hit that URL with a GET,
>
> $ curl http://backfill-1d.i.massrel.com:8983/solr/15724/update
> 
> 
> 400 name="QTime">2missing content
> stream400
> 
>
> Thoughts?
>
> Thanks.
>


Re: XPath with ExtractingRequestHandler

2013-01-19 Thread Arcadius Ahouansou
Hi Mike.

I am going through this too.

How did you solve this?

Thanks.

Arcadius.


On 15 December 2011 12:49, Michael Kelleher  wrote:

> Yeah, I tried:
>
>
> //xhtml:div[@class='**bibliographicData']/**descendant:node()
>
> also tried
>
> //xhtml:div[@class='**bibliographicData']
>
> Neither worked.  The DIV I need also had an ID value, and I tried both
> variations on ID as well.  Still nothing.
>
>
> XPath handling for Tika seems to be pretty basic and does not seem to
> support most XPath Query syntax.  Probably because it's using a Sax parser,
> I don't know.  I guess I will have to write something custom to get it to
> do what I need it to.
>
> Thanks for the reply though.
>
> I will post a follow up with how I fixed this.
>
> --mike
>


Re: updateLog in Solr 4.1

2013-01-19 Thread Mark Miller
Indexing should def not slow down substantially if you commit every minute or 
something. Be sure to use openSearcher=false on the auto hard commit.

- Mark

On Jan 19, 2013, at 11:11 PM, Nikhil Chhaochharia  wrote:

> 
> 
> Hi,
> 
> We run a SolrCloud cluster using Solr 4.0  updateLog is not configured since 
> replication etc. is not required.
> 
> 
> When we switched to Solr 4.1 RC1, we got the following exception and Solr 
> failed to start. It looks like an updateLog in essential in Solr 4.1  The 
> description for updateLog in solrconfig.xml says
> "The log can grow as big as uncommitted changes to the index, so use of a 
> hard autoCommit is recommended"
> 
> We index ~1TB data periodically (offline, no searches during indexing, start 
> with empty index) and commit only after all the documents have been added. If 
> we enable the updateLog, then the log will grow too large and if we commit 
> periodically, then the indexing slows down substantially. Is there a way to 
> use Solr 4.1 with the updateLog disabled? Any other suggestions?
> 
> 
> Thanks,
> Nikhil
> 
> 
> 
> Jan 19, 2013 10:06:22 AM org.apache.solr.cloud.SyncStrategy sync
> SEVERE: No UpdateLog found - cannot sync
> Jan 19, 2013 10:06:22 AM org.apache.solr.common.SolrException log
> SEVERE: :java.lang.NullPointerException
> at 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:187)
> at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:156)
> at 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:100)
> at 
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:266)
> at 
> org.apache.solr.cloud.ZkController.joinElection(ZkController.java:842)
> at org.apache.solr.cloud.ZkController.register(ZkController.java:668)
> at org.apache.solr.cloud.ZkController.register(ZkController.java:634)
> at 
> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890)
> at 
> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874)
>
> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823)
> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633)
> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



Re: Using Solr Spatial in conjunction with HBASE/Hadoop

2013-01-19 Thread David Smiley (@MITRE.org)
oakstream wrote
> Thanks guys!
> David,
> 
> In general and in your opinion would Lucene Spatial be the way to go to
> index hundreds of terabytes of spatial data that continually grows. 
> Mostly point data, mostly structured, however, could be polygons.  The
> searches would be within or contains in a polygon.  

That's a lot of data!  I don't know what the upper bound is on how much data
a sharded Lucene based system can handle for spatial searches given
"response times in seconds".  As with most things, people should try
themselves because there are so many variables.

I suggest separating indexed point data from indexed polygon data, so you
can optimize for both.  For example, a point cannot satisfy the "contains"
predicate, so skip that data set.  And for the "within" predicate, indexed
points is equivalent to using the "intersects" predicate which is quite
fast, and will always be the fastest predicate I believe.  Polygon "within"
polygon (or any non-point shape "within" any other non-point shape) is
something I'm currently working on -- give me a couple weeks: LUCENE-4644.  

A shortcoming of indexing non-point shapes to be aware of is that the shape
is completely represented by the gridded index.  If you want an indexed
polygon to be represented fairly precisely, then it may take an unreasonable
number of indexed grid cells to represent that shape.  Eventually, I want to
store the shape exactly (in Lucene's DocValues structure) in addition to a
course grid index so that it can be consulted for the particular cases where
the index alone isn't sure if a document's shape satisfies the predicate.


oakstream wrote
> Do you have any thoughts on using a NOSQL database (like Mongodb) or
> something else comparable.  I need response times in the seconds.  My
> thoughts are that I need some type of distributed system.  I was thinking
> about SOLRCLOUD to solve this.  I'm fairly new to Lucene/Solr.Most of
> the data is currently in HDFS/HBASE.  
> 
> I've investigated sharding Oracle and Postgres databases but this just
> doesn't seem like the ideal solution and since all the data already exists
> in HDFS, I'd like to build a solution that works on top of it but
> "real-time" or as "near" as I can get.  
> 
> Anyways, I've read some of your work in the past and appreciate your
> input.  
> I don't mind putting in some development work, just not sure the right
> approach. 
> 
> Thanks for your time. I appreciate it!

Hundreds of terabytes of data precludes relational databases.  I'm not sure
how MongoDB would fair.  And presumably HBase doesn't have spatial support
or you wouldn't be looking elsewhere (here).  Based on other conversations
I'm having about how spatial searches could work in Accumulo, I don't think
it will be able to match Lucene's performance.  In Lucene I'm able to edge
n-gram geohashes (e.g Boston geohash: DRT2Y, DRT2, DRT, DR, D) and the
intersection algorithm is able to do wonders with this.  But with Accumulo
(or HBase or Cassandra) I believe I'd have to stick with the full-length
geohash as the sorted key, which means query shapes covering lots of indexed
data will take longer as they have to fully iterate over underlying data. 
If anyone reading what I'm saying here is at least somewhat familiar with
why Lucene's Trie based number indexed fields are so fast over how things
worked before, it's for the same reason that Lucene 4 PrefixTree (a synonym
of Trie) based fields are fast.  I don't think the underlying approach is
even possible in the big-table NoSQL based systems without some kind of
inverted index.  If anyone knows otherwise then please enlighten me!

~ David Smiley



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p4034806.html
Sent from the Solr - User mailing list archive at Nabble.com.