Re: Distributing Collections across Shards

2016-03-30 Thread Salman Ansari
Thanks Erick for the help. Appreciate it.

Regards,
Salman

On Wed, Mar 30, 2016 at 7:29 AM, Erick Erickson 
wrote:

> Absolutely. You haven't said which version of Solr you're using,
> but there are several possibilities:
> 1> create the collection with replicationFactor=1, then use the
> ADDREPLICA command to specify exactly what node the  replicas
> for each shard are created on with the 'node' parameter.
> 2> For recent versions of Solr, you can create a collection with _no_
> replicas and then ADDREPLICA as you choose.
>
> Best,
> Erick
>
> On Tue, Mar 29, 2016 at 5:10 AM, Salman Ansari 
> wrote:
> > Hi,
> >
> > I believe the default behavior of creating collections distributed across
> > shards through the following command
> >
> > http://
> >
> [solrlocation]:8983/solr/admin/collections?action=CREATE&name=[collection_name]&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=[configuration_name]
> >
> > is that Solr will create the collection as follows
> >
> > *shard1: *leader in server1 and replica in server2
> > *shard2:* leader in server2 and replica in server1
> >
> > However, I have seen cases when running the above command that it creates
> > both the leader and replica on the same server.
> >
> > Wondering if there is a way to control this behavior (I mean control
> where
> > the leader and the replica of each shard will reside)?
> >
> > Regards,
> > Salman
>


Re: High Cpu sys usage

2016-03-30 Thread YouPeng Yang
Hi
  Thanks you Erik.
   The main collection that stores our trade data is set to the softcomit
when we import data using DIH. As you guess that the softcommit intervals
is " 1000 " and we have autowarm counts to 0.However
there is some collections that store our meta info in which we commit after
each add.and these metadata collections just hold a few docs.


Best Regards


2016-03-30 12:25 GMT+08:00 Erick Erickson :

> Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers"
> by bumping this limit! What that means is that your commits
> (either hard commit with openSearcher=true or softCommit) are
> happening far too frequently and your Solr instance is trying to do
> all sorts of work that is immediately thrown away and chewing up
> lots of CPU. Perhaps this will help:
>
>
> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> I'd guess that you're
>
> > commiting every second, or perhaps your indexing client is committing
> after each add. If the latter, do not do this and rely on the
> autocommit settings
> and if the formaer make those intervals as long as you can stand.
>
> > you may have your autowarm counts in your solrconfig.xml file set at
> very high numbers (let's see the filterCache settings, the queryResultCache
> settings etc.).
>
> I'd _strongly_ recommend that you put the on deck searchers back to
> 2 and figure out why you have so many overlapping searchers.
>
> Best,
> Erick
>
> On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang 
> wrote:
> > Hi Toke
> >   The number of collection is just 10.One of collection has 43
> shards,each
> > shard has two replicas.We continue  importing data from oracle all the
> time
> > while our systems provide searching service.
> >There are "Overlapping onDeckSearchers"  in my solr.logs. What is the
> > meaning about the "Overlapping onDeckSearchers" ,We set the the <
> > maxWarmingSearchers>20 and true > useColdSearcher>.Is it right ?
> >
> >
> >
> > Best Regard.
> >
> >
> > 2016-03-29 22:31 GMT+08:00 Toke Eskildsen :
> >
> >> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote:
> >> >   Our system still goes down as times going.We found lots of threads
> are
> >> > WAITING.Here is the threaddump that I copy from the web page.And 4
> >> pictures
> >> > for it.
> >> >   Is there any relationship with my problem?
> >>
> >> That is a lot of commitScheduler-threads. Do you have hundreds of
> >> collections in your cloud?
> >>
> >>
> >> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see
> >> if you got caught in a downwards spiral of concurrent commits.
> >>
> >> - Toke Eskildsen, State and University Library, Denmark
> >>
> >>
> >>
>


Re: How to implement Autosuggestion

2016-03-30 Thread Alessandro Benedetti
Hi Mugeesh, autocompletion world is not that simple as you would expect.
Which kind of auto suggestion are you interested in ?

First of all, simple string autosuggestion or document autosuggestion ? (
with more additional field to show then the label)
Are you interested in the analysis for the text to suggest ? Fuzzy
suggestions ? exact "beginning of the phrase" suggestions ? infix
suggestions ?
Try to give some example and we could help better .
There is a specific suggester component, so it is likely to be useful to
you, but let's try to discover more.

Cheers

On Mon, Mar 28, 2016 at 6:03 PM, Reth RM  wrote:

> Solr AnalyzingInfix suggester component:
> https://lucidworks.com/blog/2015/03/04/solr-suggester/
>
>
>
> On Mon, Mar 28, 2016 at 7:57 PM, Mugeesh Husain  wrote:
>
> > Hi,
> >
> > I am looking for the best way to implement autosuggestion in ecommerce
> > using solr or elasticsearch.
> >
> > I guess using ngram analyzer is not a good way if data is big.
> >
> >
> > Please suggest me any link or your opinion ?
> >
> >
> >
> > Thanks
> > Mugeesh
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/How-to-implement-Autosuggestion-tp4266434.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Solr not working on new environment

2016-03-30 Thread Jarus Bosman
OK, an update. I managed to remove the example/cloud directories, and stop
Solr. I changed my startup script to be much simpler (./solr start) and now
I get this:

*[root@ bin]# ./startsolr.sh*
*Waiting up to 30 seconds to see Solr running on port 8983 [|]*
*Started Solr server on port 8983 (pid=31937). Happy searching!*
*

 [root@nationalarchives bin]#
./solr status*

*Found 1 Solr nodes:*

*Solr process 31937 running on port 8983*
*{*
*  "solr_home":"/opt/solr-5.5.0/server/solr",*
*  "version":"5.5.0 2a228b3920a07f930f7afb6a42d0d20e184a943c - mike -
2016-02-16 15:22:52",*
*  "startTime":"2016-03-30T09:24:21.445Z",*
*  "uptime":"0 days, 0 hours, 3 minutes, 9 seconds",*
*  "memory":"62 MB (%12.6) of 490.7 MB"}*

I now want to connect to it from my Drupal installation, but I'm getting
this: "The Solr server could not be reached. Further data is therefore
unavailable." - I realise this is probably not a Solr error, just giving
all the information I have. When I try to connect to
:8983/solr, I get a timeout. Does it sound like firewall issues?

Regards,
Jarus

"Getting information off the Internet is like taking a drink from a fire
hydrant." - Mitchell Kapor

 .---.  .-.   .-..-.   .-.,'|"\.---.,--,
/ .-. )  ) \_/ /  \ \_/ )/| |\ \  / .-. ) .' .'
| | |(_)(_)   /\   (_)| | \ \ | | |(_)|  |  __
| | | |   / _ \ ) (   | |  \ \| | | | \  \ ( _)
\ `-' /  / / ) \| |   /(|`-' /\ `-' /  \  `-) )
 )---'  `-' (_)-'  /(_|  (__)`--'  )---'   )\/
(_)   (__)(_) (__)

On Wed, Mar 30, 2016 at 8:50 AM, Jarus Bosman  wrote:

> Hi Erick,
>
> Thanks for the reply. It seems I have not done all my homework yet.
>
> We used to use Solr 3.6.2 on the old environment (we're using it in
> conjunction with Drupal). When I got connectivity problems on the new
> server, I decided to rather implement the latest version of Solr (5.5.0). I
> read the Quick Start documentation and expected it to work first time, but
> not so (as per my previous email). I will read up a bit on ZooKeeper (never
> heard of it before - What is it?). Is there a good place to read up on
> getting started with ZooKeeper and the latest versions of Solr (apart from
> what you have replied, of course)?
>
> Thank you so much for your assistance,
> Jarus
>
>
> "Getting information off the Internet is like taking a drink from a fire
> hydrant." - Mitchell Kapor
>
>  .---.  .-.   .-..-.   .-.,'|"\.---.,--,
> / .-. )  ) \_/ /  \ \_/ )/| |\ \  / .-. ) .' .'
> | | |(_)(_)   /\   (_)| | \ \ | | |(_)|  |  __
> | | | |   / _ \ ) (   | |  \ \| | | | \  \ ( _)
> \ `-' /  / / ) \| |   /(|`-' /\ `-' /  \  `-) )
>  )---'  `-' (_)-'  /(_|  (__)`--'  )---'   )\/
> (_)   (__)(_) (__)
>
> On Wed, Mar 30, 2016 at 6:20 AM, Erick Erickson 
> wrote:
>
>> Good to meet you!
>>
>> It looks like you've tried to start Solr a time or two. When you start
>> up the "cloud" example
>> it creates
>> /opt/solr-5.5.0/example/cloud
>> and puts your SolrCloud stuff under there. It also automatically puts
>> your configuration
>> sets up on Zookeeper. When I get this kind of thing, I usually
>>
>> > stop Zookeeper (if running externally)
>>
>> > rm -rf /opt/solr-5.5.0/example/cloud
>>
>> > delete all the Zookeeper data. It may take a bit of poking to find out
>> where
>> the Zookeeper data is. It's usually in /tmp/zookeeper if you're running ZK
>> standalone, or in a subdirectory in Solr if you're using embedded ZK.
>> NOTE: if you're running standalone zookeeper, you should _definitely_
>> change the data dir because it may disappear from /tmp/zookeeper One
>> of Zookeeper's little quirks
>>
>> > try it all over again.
>>
>> Here's the problem. The examples (-e cloud) tries to do a bunch of stuff
>> for
>> you to get the installation up and running without having to wend your way
>> through all of the indiviual commands. Sometimes getting partway through
>> leaves you in an ambiguous state. Or at least a state you don't quite know
>> what all the moving parts are.
>>
>> Here's the steps you need to follow if you're doing them yourself rather
>> than
>> relying on the canned example
>> 1> start Zookeeper externally. For experimentation, a single ZK is quite
>> sufficient, I don't bother with 3 ZK instances and a quorum unless I'm
>> in a production situation.
>> 2> start solr with the bin/solr script, use the -c and -z options. At
>> this point,
>> you have a functioning Solr, but no collections. You should be
>> able to see the solr admin UI at http://node:8982/solr at this point.
>> 3> use the bin/solr zk -upconfig command to put a configset in ZK
>> 4> use the Collections API to create and maintain collections.
>>
>> And one more note. When you use the '-e cloud' option, you'll see
>> messages go by about starting nodes with a command like:
>>
>> bin/solr start -c -z localhost:2181 -p 8981 -s example/cloud/node1/solr
>> bin/solr start -c -z localhost:2181 -p 

RE: Deleted documents and expungeDeletes

2016-03-30 Thread Markus Jelsma
Hello - with TieredMergePolicy and default reclaimDeletesWeight of 2.0, and 
frequent updates, it is not uncommon to see a ratio of 25%. If you want deletes 
to be reclaimed more often, e.g. weight of 4.0, you will see very frequent 
merging of large segments, killing performance if you are on spinning disks.

Markus

 
 
-Original message-
> From:Erick Erickson 
> Sent: Wednesday 30th March 2016 2:50
> To: solr-user 
> Subject: Re: Deleted documents and expungeDeletes
> 
> bq: where I see that the number of deleted documents just
> keeps on growing and growing, but they never seem to be deleted
> 
> This shouldn't be happening.  The default TieredMergePolicy weights
> segments to be merged (which happens automatically) heavily as per
> the percentage of deleted docs. Here's a great visualization:
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
> 
> It may be that when you say "growing and growing", that the number of
> deleted docs hasn't reached the threshold where they get merged away.
> 
> Please specify "growing and growing", Until it gets to 15% or more of the 
> total
> then I'd start to worry. And then only if it kept growing after that.
> 
> To your questions:
> 1> This is automatic. It'll "just happen", but you will probably always carry
> some deleted docs around in your index.
> 
> 2> You always need at least as much free space as your index occupies on disk.
> In the worst case of normal merging, _all_ the segments will be merged
> and they're
> copied first. Once that's successful, then the original is deleted.
> 
> 3> Not really. Normally there should be no need.
> 
> 4> True, but usually the effect is so minuscule that nobody notices.
> People spend
> endless time obsessing about this and unless and until you can show that your
> _users_ notice, I'd ignore it.
> 
> Best,
> Erick
> 
> On Tue, Mar 29, 2016 at 8:16 AM, Jostein Elvaker Haande
>  wrote:
> > Hello everyone,
> >
> > I apologise beforehand if this is a question that has been visited
> > numerous times on this list, but after hours spent on Google and
> > talking to SOLR savvy people on #solr @ Freenode I'm still a bit at a
> > loss about SOLR and deleted documents.
> >
> > I have quite a few indexes in both production and development
> > environments, where I see that the number of deleted documents just
> > keeps on growing and growing, but they never seem to be deleted. From
> > my understanding, this can be controller in the merge policy set for
> > the current core, but I've not been able to find any specifics on the
> > topic.
> >
> > The general consensus on most search hits I've found is to perform an
> > optimize of the core, however this is both an expensive operation,
> > both in terms of CPU cycles as well as disk I/O, and also requires you
> > to have anywhere from 2 times to 3 times the size of the index
> > available on disk to be guaranteed to complete fully. Given these
> > criteria, it's often not something that is a viable option in certain
> > environments, both to it being a resource hog and often that you just
> > don't have the needed available disk space to perform the optimize.
> >
> > After having spoken with a couple of people on IRC (thanks tokee and
> > elyograg), I was made aware of an optional parameter for 
> > called 'expungeDeletes' that can explicitly make sure that deleted
> > documents are deleted from the index, i.e:
> >
> > curl http://localhost:8983/solr/coreName/update -H "Content-Type:
> > text/xml" --data-binary ''
> >
> > Now my questions are as follows:
> >
> > 1) How can I make sure that this is dealt with in my merge policy, if
> > at all possible?
> > 2) I've tried to find some disk space guidelines for 'expungeDeletes',
> > however I've not been able to find any. What are the general
> > guidelines here? Does it require as much space as an optimize, or is
> > it less "aggressive" compared to an optimize?
> > 3) Is 'expungeDeletes' the recommended method to make sure your
> > deleted documents are actually removed from the index, or should you
> > deal with this in your merge policy?
> > 4) I have also heard from talks on #SOLR that deleted documents has an
> > impact on the relevancy of performed searches. Is this correct, or
> > just misinformation?
> >
> > If you require any additional information, like snippets from my
> > configuration (solrconfig.xml), I'm more than happy to provide this.
> >
> > Again, if this is an issue that's being revisited for the Nth time, I
> > apologize, I'm just trying to get my head around this with my somewhat
> > limited SOLR knowledge.
> >
> > --
> > Yours sincerely Jostein Elvaker Haande
> > "A free society is a society where it is safe to be unpopular"
> > - Adlai Stevenson
> >
> > http://tolecnal.net -- tolecnal at tolecnal dot net
> 


[Possible Bug] 5.5.0 Startup script ignoring host parameter?

2016-03-30 Thread Bram Van Dam
Hi folks,

It looks like the "-h" parameter isn't being processed correctly. I want
Solr to listen on 127.0.0.1, but instead it binds to all interfaces. Am
I doing something wrong? Or am I misinterpreting what the -h parameter
is for?

Linux:

# bin/solr start -h 127.0.0.1 -p 8180
# netstat -tlnp | grep 8180
tcp6   0  0 :::8180 :::*
LISTEN  14215/java

Windows:

> solr.cmd start -h 127.0.0.1 -p 8180
> netstat -a
TCP 0.0.0.0:8180MyBox:0 LISTENING


The Solr JVM args are likely the cause. From the Solr Admin GUI:
-DSTOP.KEY=solrrocks
-Dhost=127.0.0.1
-Djetty.port=8180

Presumably that ought to be -Djetty.host=127.0.0.1 instead of -Dhost?

This has potential security implications for us :-(

Thanks,

 - Bram


Re: Deleted documents and expungeDeletes

2016-03-30 Thread Jostein Elvaker Haande
On 30 March 2016 at 02:49, Erick Erickson  wrote:
> Please specify "growing and growing", Until it gets to 15% or more of the 
> total
> then I'd start to worry. And then only if it kept growing after that.

I tested 'expungeDeletes' on four different cores, three of them were
nearly identical in terms of numbers. Max Docs were around ~2.2M, Num
Docs was ~1.6M and Deleted Docs were ~600K - so the percentage of
Deleted Docs were around the ~27 percent mark. So according to your
feedback, I should start to worry! Now the question is, why aren't the
Deleted Docs being merged away if this is in fact supposed to happen?

> 1> This is automatic. It'll "just happen", but you will probably always carry
> some deleted docs around in your index.

Yeah, that I am aware of - I noticed that even after running
'expungeDeletes' I had a few thousand docs left, which is acceptable
and does not worry me.

> 4> True, but usually the effect is so minuscule that nobody notices.
> People spend
> endless time obsessing about this and unless and until you can show that your
> _users_ notice, I'd ignore it.

Hehe, then I'll refrain from being one of those that obsess over this.
As long as I know the effect it has is minuscule, then I'll just toss
the thought in the bin.

-- 
Yours sincerely Jostein Elvaker Haande
"A free society is a society where it is safe to be unpopular"
- Adlai Stevenson

http://tolecnal.net -- tolecnal at tolecnal dot net


Re: Deleted documents and expungeDeletes

2016-03-30 Thread Jostein Elvaker Haande
On 30 March 2016 at 12:25, Markus Jelsma  wrote:
> Hello - with TieredMergePolicy and default reclaimDeletesWeight of 2.0, and 
> frequent updates, it is not uncommon to see a ratio of 25%. If you want 
> deletes to be reclaimed more often, e.g. weight of 4.0, you will see very 
> frequent merging of large segments, killing performance if you are on 
> spinning disks.

Most of our installations are on spinning disks, so if I want a more
aggressive reclaim, this will impact performance. This is of course
something that I do not desire, so I'm wondering if scheduling a
commit with 'expungeDeletes' during off peak business hours is a
better approach than setting up a more aggressive merge policy.

-- 
Yours sincerely Jostein Elvaker Haande
"A free society is a society where it is safe to be unpopular"
- Adlai Stevenson

http://tolecnal.net -- tolecnal at tolecnal dot net


Re: Deleted documents and expungeDeletes

2016-03-30 Thread David Santamauro



On 03/30/2016 08:23 AM, Jostein Elvaker Haande wrote:

On 30 March 2016 at 12:25, Markus Jelsma  wrote:

Hello - with TieredMergePolicy and default reclaimDeletesWeight of 2.0, and 
frequent updates, it is not uncommon to see a ratio of 25%. If you want deletes 
to be reclaimed more often, e.g. weight of 4.0, you will see very frequent 
merging of large segments, killing performance if you are on spinning disks.


Most of our installations are on spinning disks, so if I want a more
aggressive reclaim, this will impact performance. This is of course
something that I do not desire, so I'm wondering if scheduling a
commit with 'expungeDeletes' during off peak business hours is a
better approach than setting up a more aggressive merge policy.



As far as my experimentation with @expungeDeletes goes, if the data you 
indexed and committed using @expungeDeletes didn't touch segments with 
any deleted documents nor wasn't enough data to cause merging with a 
segment containing deleted documents, no deleted documents will be 
removed. Basically, @expungeDeletes expunges deletes in segments 
affected by the commit. If you have a large update that touches many 
segments containing deleted documents and you use @expungeDeletes, it 
could be just as resource intensive as an optimize.


My setting for reclaimDeletesWeight:
  5.0

It keeps the deleted documents down to ~ 10% without any noticable 
impact on resources or performance. But I'm still in the testing phase 
with this setting.




Re: Regarding JSON indexing in SOLR 4.10

2016-03-30 Thread Paul Hoffman
On Tue, Mar 29, 2016 at 11:30:06PM -0700, Aditya Desai wrote:
> I am running SOLR 4.10 on port 8984 by changing the default port in
> etc/jetty.xml. I am now trying to index all my JSON files to Solr running
> on 8984. The following is the command
> 
> curl 'http://localhost:8984/solr/update?commit=true' --data-binary *.json
> -H 'Content-type:application/json'

The wildcard is the problem; your shell is expanding --data-binary 
*.json to --data-binary foo.json bar.json baz.json and curl doesn't know 
how to download bar.json and baz.json.

Try this instead:

for file in *.json; do
curl 'http://localhost:8984/solr/update?commit=true' --data-binary "$file" 
-H 'Content-type:application/json'
done

Paul.

-- 
Paul Hoffman 
Systems Librarian
Fenway Libraries Online
c/o Wentworth Institute of Technology
550 Huntington Ave.
Boston, MA 02115
(617) 442-2384 (FLO main number)


Re: [Possible Bug] 5.5.0 Startup script ignoring host parameter?

2016-03-30 Thread Shawn Heisey
On 3/30/2016 5:45 AM, Bram Van Dam wrote:
> It looks like the "-h" parameter isn't being processed correctly. I want
> Solr to listen on 127.0.0.1, but instead it binds to all interfaces. Am
> I doing something wrong? Or am I misinterpreting what the -h parameter
> is for?

The host parameter does not control binding to network interfaces.  It
controls what hostname is published to zookeeper when running in cloud mode.

Solr's networking is provided by a third-party application -- the
servlet container.  In 5.x, we took steps to ensure that the container
everyone uses is Jetty.  The default Jetty configuration supplied with
Solr will bind to all interfaces.  If you want to control interface
binding, you need to edit the Jetty config, in server/etc.  The file
most likely to need changes is server/etc/jetty.xml.

The folliwing URL contains Jetty's documentation on how to configure the
networking.  Right now this URL applies to version 9, used in Solr 5.x:

http://www.eclipse.org/jetty/documentation/current/configuring-connectors.html

Thanks,
Shawn



Re: Deleted documents and expungeDeletes

2016-03-30 Thread Erick Erickson
through a clever bit of reflection, you can set the
reclaimDeletesWeight variable from solrconfig by including something
like
5 (going from memory
here, you'll get an error on startup if I've messed it up.)

That may help..

Best,
Erick


On Wed, Mar 30, 2016 at 6:15 AM, David Santamauro
 wrote:
>
>
> On 03/30/2016 08:23 AM, Jostein Elvaker Haande wrote:
>>
>> On 30 March 2016 at 12:25, Markus Jelsma 
>> wrote:
>>>
>>> Hello - with TieredMergePolicy and default reclaimDeletesWeight of 2.0,
>>> and frequent updates, it is not uncommon to see a ratio of 25%. If you want
>>> deletes to be reclaimed more often, e.g. weight of 4.0, you will see very
>>> frequent merging of large segments, killing performance if you are on
>>> spinning disks.
>>
>>
>> Most of our installations are on spinning disks, so if I want a more
>> aggressive reclaim, this will impact performance. This is of course
>> something that I do not desire, so I'm wondering if scheduling a
>> commit with 'expungeDeletes' during off peak business hours is a
>> better approach than setting up a more aggressive merge policy.
>>
>
> As far as my experimentation with @expungeDeletes goes, if the data you
> indexed and committed using @expungeDeletes didn't touch segments with any
> deleted documents nor wasn't enough data to cause merging with a segment
> containing deleted documents, no deleted documents will be removed.
> Basically, @expungeDeletes expunges deletes in segments affected by the
> commit. If you have a large update that touches many segments containing
> deleted documents and you use @expungeDeletes, it could be just as resource
> intensive as an optimize.
>
> My setting for reclaimDeletesWeight:
>   5.0
>
> It keeps the deleted documents down to ~ 10% without any noticable impact on
> resources or performance. But I'm still in the testing phase with this
> setting.
>


Re: Solr not working on new environment

2016-03-30 Thread Erick Erickson
Whoa! I thought you were going for SolrCloud. If you're not interested in
SolrCloud, you don't need to know anything about Zookeeper.

So it looks like Solr is running. You say:

bq:  When I try to connect to :8983/solr, I get a timeout.
Does it sound like firewall issues?

are you talking about Drupal or about a simple browser connection? If
the former, I'm all out of ideas
as I know very little about the Drupal integration and/or whether it's
even possible with a 5.x...

Best,
Erick

On Wed, Mar 30, 2016 at 2:52 AM, Jarus Bosman  wrote:
> OK, an update. I managed to remove the example/cloud directories, and stop
> Solr. I changed my startup script to be much simpler (./solr start) and now
> I get this:
>
> *[root@ bin]# ./startsolr.sh*
> *Waiting up to 30 seconds to see Solr running on port 8983 [|]*
> *Started Solr server on port 8983 (pid=31937). Happy searching!*
> *
>
>  [root@nationalarchives bin]#
> ./solr status*
>
> *Found 1 Solr nodes:*
>
> *Solr process 31937 running on port 8983*
> *{*
> *  "solr_home":"/opt/solr-5.5.0/server/solr",*
> *  "version":"5.5.0 2a228b3920a07f930f7afb6a42d0d20e184a943c - mike -
> 2016-02-16 15:22:52",*
> *  "startTime":"2016-03-30T09:24:21.445Z",*
> *  "uptime":"0 days, 0 hours, 3 minutes, 9 seconds",*
> *  "memory":"62 MB (%12.6) of 490.7 MB"}*
>
> I now want to connect to it from my Drupal installation, but I'm getting
> this: "The Solr server could not be reached. Further data is therefore
> unavailable." - I realise this is probably not a Solr error, just giving
> all the information I have. When I try to connect to
> :8983/solr, I get a timeout. Does it sound like firewall issues?
>
> Regards,
> Jarus
>
> "Getting information off the Internet is like taking a drink from a fire
> hydrant." - Mitchell Kapor
>
>  .---.  .-.   .-..-.   .-.,'|"\.---.,--,
> / .-. )  ) \_/ /  \ \_/ )/| |\ \  / .-. ) .' .'
> | | |(_)(_)   /\   (_)| | \ \ | | |(_)|  |  __
> | | | |   / _ \ ) (   | |  \ \| | | | \  \ ( _)
> \ `-' /  / / ) \| |   /(|`-' /\ `-' /  \  `-) )
>  )---'  `-' (_)-'  /(_|  (__)`--'  )---'   )\/
> (_)   (__)(_) (__)
>
> On Wed, Mar 30, 2016 at 8:50 AM, Jarus Bosman  wrote:
>
>> Hi Erick,
>>
>> Thanks for the reply. It seems I have not done all my homework yet.
>>
>> We used to use Solr 3.6.2 on the old environment (we're using it in
>> conjunction with Drupal). When I got connectivity problems on the new
>> server, I decided to rather implement the latest version of Solr (5.5.0). I
>> read the Quick Start documentation and expected it to work first time, but
>> not so (as per my previous email). I will read up a bit on ZooKeeper (never
>> heard of it before - What is it?). Is there a good place to read up on
>> getting started with ZooKeeper and the latest versions of Solr (apart from
>> what you have replied, of course)?
>>
>> Thank you so much for your assistance,
>> Jarus
>>
>>
>> "Getting information off the Internet is like taking a drink from a fire
>> hydrant." - Mitchell Kapor
>>
>>  .---.  .-.   .-..-.   .-.,'|"\.---.,--,
>> / .-. )  ) \_/ /  \ \_/ )/| |\ \  / .-. ) .' .'
>> | | |(_)(_)   /\   (_)| | \ \ | | |(_)|  |  __
>> | | | |   / _ \ ) (   | |  \ \| | | | \  \ ( _)
>> \ `-' /  / / ) \| |   /(|`-' /\ `-' /  \  `-) )
>>  )---'  `-' (_)-'  /(_|  (__)`--'  )---'   )\/
>> (_)   (__)(_) (__)
>>
>> On Wed, Mar 30, 2016 at 6:20 AM, Erick Erickson 
>> wrote:
>>
>>> Good to meet you!
>>>
>>> It looks like you've tried to start Solr a time or two. When you start
>>> up the "cloud" example
>>> it creates
>>> /opt/solr-5.5.0/example/cloud
>>> and puts your SolrCloud stuff under there. It also automatically puts
>>> your configuration
>>> sets up on Zookeeper. When I get this kind of thing, I usually
>>>
>>> > stop Zookeeper (if running externally)
>>>
>>> > rm -rf /opt/solr-5.5.0/example/cloud
>>>
>>> > delete all the Zookeeper data. It may take a bit of poking to find out
>>> where
>>> the Zookeeper data is. It's usually in /tmp/zookeeper if you're running ZK
>>> standalone, or in a subdirectory in Solr if you're using embedded ZK.
>>> NOTE: if you're running standalone zookeeper, you should _definitely_
>>> change the data dir because it may disappear from /tmp/zookeeper One
>>> of Zookeeper's little quirks
>>>
>>> > try it all over again.
>>>
>>> Here's the problem. The examples (-e cloud) tries to do a bunch of stuff
>>> for
>>> you to get the installation up and running without having to wend your way
>>> through all of the indiviual commands. Sometimes getting partway through
>>> leaves you in an ambiguous state. Or at least a state you don't quite know
>>> what all the moving parts are.
>>>
>>> Here's the steps you need to follow if you're doing them yourself rather
>>> than
>>> relying on the canned example
>>> 1> start Zookeeper externally. For experimentation, a single ZK is quite
>>> sufficient, I don't

Re: High Cpu sys usage

2016-03-30 Thread Erick Erickson
Both of these are anit-patterns. The soft commit interval of 1 second
is usually far too aggressive. And committing after every add is
also something to avoid.

Your original problem statement is high CPU usage. To see if your
committing is the culprit, I'd stop committing at all after adding and
make the soft commit interval, say, 60 seconds. And keep the
hard commit interval whatever it is not but make sure openSearcher is
set to false.

That should pinpoint whether the CPU usage is just because of your
committing. From there you can figure out the right balance...

If that's _not_ the source of your CPU usage, then at least you'll have
eliminated it as a potential problem.

Best,
Erick

On Wed, Mar 30, 2016 at 12:37 AM, YouPeng Yang
 wrote:
> Hi
>   Thanks you Erik.
>The main collection that stores our trade data is set to the softcomit
> when we import data using DIH. As you guess that the softcommit intervals
> is " 1000 " and we have autowarm counts to 0.However
> there is some collections that store our meta info in which we commit after
> each add.and these metadata collections just hold a few docs.
>
>
> Best Regards
>
>
> 2016-03-30 12:25 GMT+08:00 Erick Erickson :
>
>> Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers"
>> by bumping this limit! What that means is that your commits
>> (either hard commit with openSearcher=true or softCommit) are
>> happening far too frequently and your Solr instance is trying to do
>> all sorts of work that is immediately thrown away and chewing up
>> lots of CPU. Perhaps this will help:
>>
>>
>> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>>
>> I'd guess that you're
>>
>> > commiting every second, or perhaps your indexing client is committing
>> after each add. If the latter, do not do this and rely on the
>> autocommit settings
>> and if the formaer make those intervals as long as you can stand.
>>
>> > you may have your autowarm counts in your solrconfig.xml file set at
>> very high numbers (let's see the filterCache settings, the queryResultCache
>> settings etc.).
>>
>> I'd _strongly_ recommend that you put the on deck searchers back to
>> 2 and figure out why you have so many overlapping searchers.
>>
>> Best,
>> Erick
>>
>> On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang 
>> wrote:
>> > Hi Toke
>> >   The number of collection is just 10.One of collection has 43
>> shards,each
>> > shard has two replicas.We continue  importing data from oracle all the
>> time
>> > while our systems provide searching service.
>> >There are "Overlapping onDeckSearchers"  in my solr.logs. What is the
>> > meaning about the "Overlapping onDeckSearchers" ,We set the the <
>> > maxWarmingSearchers>20 and true> > useColdSearcher>.Is it right ?
>> >
>> >
>> >
>> > Best Regard.
>> >
>> >
>> > 2016-03-29 22:31 GMT+08:00 Toke Eskildsen :
>> >
>> >> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote:
>> >> >   Our system still goes down as times going.We found lots of threads
>> are
>> >> > WAITING.Here is the threaddump that I copy from the web page.And 4
>> >> pictures
>> >> > for it.
>> >> >   Is there any relationship with my problem?
>> >>
>> >> That is a lot of commitScheduler-threads. Do you have hundreds of
>> >> collections in your cloud?
>> >>
>> >>
>> >> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see
>> >> if you got caught in a downwards spiral of concurrent commits.
>> >>
>> >> - Toke Eskildsen, State and University Library, Denmark
>> >>
>> >>
>> >>
>>


Re: Solr response error 403 when I try to index medium.com articles

2016-03-30 Thread Jeferson dos Anjos
Jack, thanks for the reply. With other sites over https I'm not having
trouble. What logic suggests you change? Did not quite understand.

2016-03-29 21:01 GMT-03:00 Jack Krupansky :

> Medium switches from http to https, so you would need the logic for dealing
> with https security handshakes.
>
> -- Jack Krupansky
>
> On Tue, Mar 29, 2016 at 7:54 PM, Jeferson dos Anjos <
> jefersonan...@packdocs.com> wrote:
>
> > I'm trying to index some pages of the medium. But I get error 403. I
> > believe it is because the medium does not accept the user-agent solr. Has
> > anyone ever experienced this? You know how to change?
> >
> > I appreciate any help
> >
> > 
> > 500
> > 94
> > 
> > 
> > 
> > Server returned HTTP response code: 403 for URL:
> >
> >
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> > 
> > 
> > java.io.IOException: Server returned HTTP response code: 403 for URL:
> >
> >
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> > at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
> > Source) at
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> > Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
> > at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> > at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> > at java.security.AccessController.doPrivileged(Native Method) at
> > sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
> > Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> > Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> > Source) at
> > sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
> > Source) at
> >
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
> > at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> > at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
> > at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
> >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> > at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> > at org.eclipse.jetty.server.Server.handle(Server.java:368) at
> >
> >
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> > at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> > at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
> > at
> >
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
> > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at
> > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> > at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> > at
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

Re: Solr response error 403 when I try to index medium.com articles

2016-03-30 Thread Chris Hostetter

403 means "forbidden" 

Something about the request Solr is sending -- or soemthing about the IP 
address Solr is connecting from when talking to medium.com -- is causing 
hte medium.com web server to reject the request.

This is something that servers may choose to do if they detect (via 
headers, or missing headers, or reverse ip lookup, or other 
distinctive nuances of how the connection was made) that the 
client connecting to their server isn't a "human browser" (ie: firefox, 
chrome, safari) and is a Robot that they don't want to cooperate with (ie: 
they might be happy toserve their pages to the google-bot crawler, but not 
to some third-party they've never heard of.

The specifics of how/why you might get a 403 for any given url are hard to 
debug -- it might literally depend on how many requests you've sent tothat 
domain in the past X hours.

In general Solr's ContentStream indexing from remote hosts isn't inteded 
to be a super robust solution for crawling arbitrary websites on the web 
-- if that's your goal, then i would suggest you look into running a more 
robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more 
features and debugging options (notably: rate limiting) and use that code 
to feath the content, then push it to Solr.


: Date: Tue, 29 Mar 2016 20:54:52 -0300
: From: Jeferson dos Anjos 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Solr response error 403 when I try to index medium.com articles
: 
: I'm trying to index some pages of the medium. But I get error 403. I
: believe it is because the medium does not accept the user-agent solr. Has
: anyone ever experienced this? You know how to change?
: 
: I appreciate any help
: 
: 
: 500
: 94
: 
: 
: 
: Server returned HTTP response code: 403 for URL:
: 
https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
: 
: 
: java.io.IOException: Server returned HTTP response code: 403 for URL:
: 
https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
: at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
: Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
: Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
: at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
: at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
: at java.security.AccessController.doPrivileged(Native Method) at
: sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
: Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
: Source) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
: Source) at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
: Source) at 
org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
: at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
: at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
: at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
: at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
: at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
: 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
: at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
: at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
: at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
: at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
: at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
: at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
: at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
: at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
: at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
: at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
: at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
: at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
: at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
: at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
: at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
: at org.eclipse.jetty.server.Server.handle(Server.java:368) at
: 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
: at 
org.eclipse.jetty.server.Bl

Re: Load Resource from within Solr Plugin

2016-03-30 Thread Chris Hostetter
: 
: 

1) as a general rule, if you have a  delcaration which includes 
"WEB-INF" you are probably doing something wrong.

Maybe not in this case -- maybe "search-webapp/target" is a completley 
distinct java application and you are just re-using it's jars.  But 9 
times out of 10, when people have a  WEB-INF path they are trying to load 
jars from, it's because they *first* added their jars to Solr's WEB_INF 
directory, and then when that didn't work they added the path to the 
WEB-INF dir as a  ... but now you've got those classes being loaded 
twice, and you've multiplied all of your problems.

2) let's ignore the fact that your path has WEB-INF in it, and just 
assume it's some path to somewhere where on disk that has nothing to 
do with solr, and you want to load those jars.

great -- solr will do that for you, and all of those classes will be 
available to plugins.

Now if you wnat to explicitly do something classloader related, you do 
*not* want to be using Thread.currentThread().getContextClassLoader() ... 
because the threads that execute everything in Solr are a pool of worker 
threads that is created before solr ever has a chance to parse your  directive.

You want to ensure anything you do related to a Classloader uses the 
ClassLoader Solr sets up for plugins -- that's available from the 
SolrResourceLoader.

You can always get the SolrResourceLoader via 
SolrCore.getSolrResourceLoader().  from there you can getClassLoader() if 
you really need some hairy custom stuff -- or if you are just trying to 
load a simple resource file as an InputStream, use openResource(String 
name) ... that will start by checking for it in the conf dir, and will 
fallback to your jar -- so you can have a default resource file shipped 
with your plugin, but allow users to override it in their collection 
configs.


-Hoss
http://www.lucidworks.com/


Re: Regarding JSON indexing in SOLR 4.10

2016-03-30 Thread Aditya Desai
Hi Paul

Thanks a lot for your help! I have one small question, I have schema that
includes {Keyword,id,currency,geographic_name}. Now I have given
id
And

Whenever I am running your script I am getting an error as


4002Document is
missing mandatory uniqueKey field: id400


Can you please share your expertise advice here. Can you please guide me a
good source to learn SOLR?

I am learning and I would really appreciate if you can help me.

Regards


On Wed, Mar 30, 2016 at 6:55 AM, Paul Hoffman  wrote:

> On Tue, Mar 29, 2016 at 11:30:06PM -0700, Aditya Desai wrote:
> > I am running SOLR 4.10 on port 8984 by changing the default port in
> > etc/jetty.xml. I am now trying to index all my JSON files to Solr running
> > on 8984. The following is the command
> >
> > curl '
> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
> ' --data-binary *.json
> > -H 'Content-type:application/json'
>
> The wildcard is the problem; your shell is expanding --data-binary
> *.json to --data-binary foo.json bar.json baz.json and curl doesn't know
> how to download bar.json and baz.json.
>
> Try this instead:
>
> for file in *.json; do
> curl '
> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
> ' --data-binary "$file" -H 'Content-type:application/json'
> done
>
> Paul.
>
> --
> Paul Hoffman 
> Systems Librarian
> Fenway Libraries Online
> c/o Wentworth Institute of Technology
> 550 Huntington Ave.
> Boston, MA 02115
> (617) 442-2384 (FLO main number)
>



-- 
Aditya Ramachandra Desai
MS Computer Science Graduate Student
USC Viterbi School of Engineering
Los Angeles, CA 90007
M : +1-415-463-9864 | L : https://www.linkedin.com/in/adityardesai


[possible bug]: [child] - ChildDocTransformerFactory returns top level documents nested under middle level documents when queried for the middle level ones

2016-03-30 Thread Alisa Z .
 I think I am observing an unexpected behavior of ChildDocTransformerFactory. 

The query is like this: 

/select?q={!parent which= "type_s:doc.enriched.text "}t 
ype_s:doc.enriched.text.entities  +text_t:pjm +type_t:Company 
+relevance_tf:[0.7%20TO%20*]&fl=*,[child  parentFilter=type_s:doc.enriched.text 
 limit=1000]

The levels of hierarchy are shown in the  type_s field.  So I am querying on 
some descendants and returning some ancestors that are somewhere in the middle 
of the hierarchy. I also want to get all the nested documents  below  that 
middle level. 

Here is the result:




 doc.enriched.text    // this is the level I 
wanted to get to and then go down from it 
 ... 
 13565 

 doc.enriched   // This is a document from 1 
level up, the parent of the   
   // current  type_s : 
doc.enriched.text document -- why is it here?   
 22024 


 doc.original   // This is an "uncle"
 26698 


 doc    // and this a grandparent!!! 
 
   
   


And so on, bringing the whole tree up and down all under my middle-level 
document.  
I really hope this is not the expected behavior.

I appreciate your help in advance. 

-- 
Alisa Zhila

Re: [possible bug]: [child] - ChildDocTransformerFactory returns top level documents nested under middle level documents when queried for the middle level ones

2016-03-30 Thread Anshum Gupta
I'm not the best person to comment on this so perhaps someone could chime
in as well, but can you try using a wildcard for your childFilter?
Something like: childFilter=type_s:doc.enriched.text.*

You could also possibly enrich the document with depth information and use
that for filtering out.

On Wed, Mar 30, 2016 at 11:34 AM, Alisa Z.  wrote:

>  I think I am observing an unexpected behavior of
> ChildDocTransformerFactory.
>
> The query is like this:
>
> /select?q={!parent which= "type_s:doc.enriched.text "}t
> ype_s:doc.enriched.text.entities  +text_t:pjm +type_t:Company
> +relevance_tf:[0.7%20TO%20*]&fl=*,[child
> parentFilter=type_s:doc.enriched.text  limit=1000]
>
> The levels of hierarchy are shown in the  type_s field.  So I am querying
> on some descendants and returning some ancestors that are somewhere in the
> middle of the hierarchy. I also want to get all the nested documents
> below  that middle level.
>
> Here is the result:
>
> 
> 
>
>  doc.enriched.text// this is the level
> I wanted to get to and then go down from it
>  ... 
>  13565 
> 
>  doc.enriched   // This is a document
> from 1 level up, the parent of the
>// current  type_s :
> doc.enriched.text document -- why is it here?
>  22024 
> 
> 
>  doc.original   // This is an "uncle"
>  26698 
> 
> 
>  doc// and this a
> grandparent!!!
>
>
> 
>
> And so on, bringing the whole tree up and down all under my middle-level
> document.
> I really hope this is not the expected behavior.
>
> I appreciate your help in advance.
>
> --
> Alisa Zhila




-- 
Anshum Gupta


Re: Solr response error 403 when I try to index medium.com articles

2016-03-30 Thread Jack Krupansky
You could use the curl command to read a URL on Medium.com. That would let
you examine and control the headers to experiment.

Google is able to index Medium.

Check the URL and make sure it's not on one of the paths disallowed by
medium.com/robots.txt (the one you gave seems fine):

User-Agent: *
Disallow: /_/
Disallow: /m/
Disallow: /me/
Disallow: /@me$
Disallow: /@me/
Disallow: /*/*/edit
Sitemap: https://medium.com/sitemap/sitemap.xml



-- Jack Krupansky

On Wed, Mar 30, 2016 at 1:05 PM, Chris Hostetter 
wrote:

>
> 403 means "forbidden"
>
> Something about the request Solr is sending -- or soemthing about the IP
> address Solr is connecting from when talking to medium.com -- is causing
> hte medium.com web server to reject the request.
>
> This is something that servers may choose to do if they detect (via
> headers, or missing headers, or reverse ip lookup, or other
> distinctive nuances of how the connection was made) that the
> client connecting to their server isn't a "human browser" (ie: firefox,
> chrome, safari) and is a Robot that they don't want to cooperate with (ie:
> they might be happy toserve their pages to the google-bot crawler, but not
> to some third-party they've never heard of.
>
> The specifics of how/why you might get a 403 for any given url are hard to
> debug -- it might literally depend on how many requests you've sent tothat
> domain in the past X hours.
>
> In general Solr's ContentStream indexing from remote hosts isn't inteded
> to be a super robust solution for crawling arbitrary websites on the web
> -- if that's your goal, then i would suggest you look into running a more
> robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more
> features and debugging options (notably: rate limiting) and use that code
> to feath the content, then push it to Solr.
>
>
> : Date: Tue, 29 Mar 2016 20:54:52 -0300
> : From: Jeferson dos Anjos 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Solr response error 403 when I try to index medium.com articles
> :
> : I'm trying to index some pages of the medium. But I get error 403. I
> : believe it is because the medium does not accept the user-agent solr. Has
> : anyone ever experienced this? You know how to change?
> :
> : I appreciate any help
> :
> : 
> : 500
> : 94
> : 
> : 
> : 
> : Server returned HTTP response code: 403 for URL:
> :
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> : 
> : 
> : java.io.IOException: Server returned HTTP response code: 403 for URL:
> :
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> : at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
> : Source) at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> : Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
> : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> : at java.security.AccessController.doPrivileged(Native Method) at
> : sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> : Source) at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> : Source) at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
> : Source) at
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
> : at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> : at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> : at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
> : at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
> : at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
> :
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> : at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> : at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
> : at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> : at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> : at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> : at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> : at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> : at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> : at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> : at
> org.eclipse.jetty.server.session.SessionHandler.doSc

Re: Load Resource from within Solr Plugin

2016-03-30 Thread Rajesh Hazari
Max,
Have you looked in External file field which is reload on every hard commit,
only disadvantage of this is the file (personal-words.txt) has to be placed
in all data folders in each solr core,
for which we have a bash script to do this job.

https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

Ignore this if this does not meets your requirement.

*Rajesh**.*

On Wed, Mar 30, 2016 at 1:21 PM, Chris Hostetter 
wrote:

> :
> :  : regex=".*\.jar" />
>
> 1) as a general rule, if you have a  delcaration which includes
> "WEB-INF" you are probably doing something wrong.
>
> Maybe not in this case -- maybe "search-webapp/target" is a completley
> distinct java application and you are just re-using it's jars.  But 9
> times out of 10, when people have a  WEB-INF path they are trying to load
> jars from, it's because they *first* added their jars to Solr's WEB_INF
> directory, and then when that didn't work they added the path to the
> WEB-INF dir as a  ... but now you've got those classes being loaded
> twice, and you've multiplied all of your problems.
>
> 2) let's ignore the fact that your path has WEB-INF in it, and just
> assume it's some path to somewhere where on disk that has nothing to
> do with solr, and you want to load those jars.
>
> great -- solr will do that for you, and all of those classes will be
> available to plugins.
>
> Now if you wnat to explicitly do something classloader related, you do
> *not* want to be using Thread.currentThread().getContextClassLoader() ...
> because the threads that execute everything in Solr are a pool of worker
> threads that is created before solr ever has a chance to parse your  /> directive.
>
> You want to ensure anything you do related to a Classloader uses the
> ClassLoader Solr sets up for plugins -- that's available from the
> SolrResourceLoader.
>
> You can always get the SolrResourceLoader via
> SolrCore.getSolrResourceLoader().  from there you can getClassLoader() if
> you really need some hairy custom stuff -- or if you are just trying to
> load a simple resource file as an InputStream, use openResource(String
> name) ... that will start by checking for it in the conf dir, and will
> fallback to your jar -- so you can have a default resource file shipped
> with your plugin, but allow users to override it in their collection
> configs.
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Regarding JSON indexing in SOLR 4.10

2016-03-30 Thread Erick Erickson
The document you're sending to Solr doesn't have an "id" field. The
copyField directive has
nothing to do with it. And you copyField would be copying _from_ the
id field _to_ the
Keyword field, is that what you intended?

Even if the source and dest fields were reversed, it still wouldn't
work since there is no id
field as indicated by the error.

Let's see one of the json files please? Are they carefully-formulated
or arbitrary files? If
carefully formulated, just switch

Best,
Erick

On Wed, Mar 30, 2016 at 11:26 AM, Aditya Desai  wrote:
> Hi Paul
>
> Thanks a lot for your help! I have one small question, I have schema that
> includes {Keyword,id,currency,geographic_name}. Now I have given
> id
> And
> 
> Whenever I am running your script I am getting an error as
>
> 
> 400 name="QTime">2Document is
> missing mandatory uniqueKey field: id400
> 
>
> Can you please share your expertise advice here. Can you please guide me a
> good source to learn SOLR?
>
> I am learning and I would really appreciate if you can help me.
>
> Regards
>
>
> On Wed, Mar 30, 2016 at 6:55 AM, Paul Hoffman  wrote:
>
>> On Tue, Mar 29, 2016 at 11:30:06PM -0700, Aditya Desai wrote:
>> > I am running SOLR 4.10 on port 8984 by changing the default port in
>> > etc/jetty.xml. I am now trying to index all my JSON files to Solr running
>> > on 8984. The following is the command
>> >
>> > curl '
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
>> ' --data-binary *.json
>> > -H 'Content-type:application/json'
>>
>> The wildcard is the problem; your shell is expanding --data-binary
>> *.json to --data-binary foo.json bar.json baz.json and curl doesn't know
>> how to download bar.json and baz.json.
>>
>> Try this instead:
>>
>> for file in *.json; do
>> curl '
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
>> ' --data-binary "$file" -H 'Content-type:application/json'
>> done
>>
>> Paul.
>>
>> --
>> Paul Hoffman 
>> Systems Librarian
>> Fenway Libraries Online
>> c/o Wentworth Institute of Technology
>> 550 Huntington Ave.
>> Boston, MA 02115
>> (617) 442-2384 (FLO main number)
>>
>
>
>
> --
> Aditya Ramachandra Desai
> MS Computer Science Graduate Student
> USC Viterbi School of Engineering
> Los Angeles, CA 90007
> M : +1-415-463-9864 | L : https://www.linkedin.com/in/adityardesai


Re: Regarding JSON indexing in SOLR 4.10

2016-03-30 Thread Aditya Desai
Hi Erick

Thanks for your email. Here is the attached sample JSON file. When I
indexed the same JSON file with SOLR 5.5 using bin/post it indexed
successfully. Also all of my documents were indexed successfully with 5.5
and not with 4.10.

Regards

On Wed, Mar 30, 2016 at 3:13 PM, Erick Erickson 
wrote:

> The document you're sending to Solr doesn't have an "id" field. The
> copyField directive has
> nothing to do with it. And you copyField would be copying _from_ the
> id field _to_ the
> Keyword field, is that what you intended?
>
> Even if the source and dest fields were reversed, it still wouldn't
> work since there is no id
> field as indicated by the error.
>
> Let's see one of the json files please? Are they carefully-formulated
> or arbitrary files? If
> carefully formulated, just switch
>
> Best,
> Erick
>
> On Wed, Mar 30, 2016 at 11:26 AM, Aditya Desai  wrote:
> > Hi Paul
> >
> > Thanks a lot for your help! I have one small question, I have schema that
> > includes {Keyword,id,currency,geographic_name}. Now I have given
> > id
> > And
> > 
> > Whenever I am running your script I am getting an error as
> >
> > 
> > 400 > name="QTime">2Document is
> > missing mandatory uniqueKey field: id name="code">400
> > 
> >
> > Can you please share your expertise advice here. Can you please guide me
> a
> > good source to learn SOLR?
> >
> > I am learning and I would really appreciate if you can help me.
> >
> > Regards
> >
> >
> > On Wed, Mar 30, 2016 at 6:55 AM, Paul Hoffman  wrote:
> >
> >> On Tue, Mar 29, 2016 at 11:30:06PM -0700, Aditya Desai wrote:
> >> > I am running SOLR 4.10 on port 8984 by changing the default port in
> >> > etc/jetty.xml. I am now trying to index all my JSON files to Solr
> running
> >> > on 8984. The following is the command
> >> >
> >> > curl '
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
> >> ' --data-binary *.json
> >> > -H 'Content-type:application/json'
> >>
> >> The wildcard is the problem; your shell is expanding --data-binary
> >> *.json to --data-binary foo.json bar.json baz.json and curl doesn't know
> >> how to download bar.json and baz.json.
> >>
> >> Try this instead:
> >>
> >> for file in *.json; do
> >> curl '
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
> >> ' --data-binary "$file" -H 'Content-type:application/json'
> >> done
> >>
> >> Paul.
> >>
> >> --
> >> Paul Hoffman 
> >> Systems Librarian
> >> Fenway Libraries Online
> >> c/o Wentworth Institute of Technology
> >> 550 Huntington Ave.
> >> Boston, MA 02115
> >> (617) 442-2384 (FLO main number)
> >>
> >
> >
> >
> > --
> > Aditya Ramachandra Desai
> > MS Computer Science Graduate Student
> > USC Viterbi School of Engineering
> > Los Angeles, CA 90007
> > M : +1-415-463-9864 | L :
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_adityardesai&d=CwIFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=ihbpCZYoNmoSqzckKlY5lkESOZXPuLtNIGjnLZCzj78&s=YD-dm-5blmQ07_4vYFoLz6r0NqKRNK1aHtIgHUvc48U&e=
>



-- 
Aditya Ramachandra Desai
MS Computer Science Graduate Student
USC Viterbi School of Engineering
Los Angeles, CA 90007
M : +1-415-463-9864 | L : https://www.linkedin.com/in/adityardesai


0A0B69C000E730AE9A1F08E6D7442CC0FB94FC0512624704D06EB48E03C49E16_Output.json
Description: application/json


issue with 5.3.1 and index version

2016-03-30 Thread William Bell
When I index 5.4.1 using luceneVer in solrlconfig.xml of 5.3.1, the
segmentsw_9 files has in it Lucene54. Why? Is this a known bug?

#strings segments_9

segments

Lucene54

commitTimeMSec

1459374733276




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Regarding JSON indexing in SOLR 4.10

2016-03-30 Thread Erick Erickson
Hmmm, not sure and unfortunately won't be able to look very closely.
Do the Solr logs say anything more informative?

Also, the admin UI>>select core>>documents lets you submit docs
interactively to Solr, that's also worth a try I should think.

Best,
Erick

On Wed, Mar 30, 2016 at 3:15 PM, Aditya Desai  wrote:
> Hi Erick
>
> Thanks for your email. Here is the attached sample JSON file. When I indexed
> the same JSON file with SOLR 5.5 using bin/post it indexed successfully.
> Also all of my documents were indexed successfully with 5.5 and not with
> 4.10.
>
> Regards
>
> On Wed, Mar 30, 2016 at 3:13 PM, Erick Erickson 
> wrote:
>>
>> The document you're sending to Solr doesn't have an "id" field. The
>> copyField directive has
>> nothing to do with it. And you copyField would be copying _from_ the
>> id field _to_ the
>> Keyword field, is that what you intended?
>>
>> Even if the source and dest fields were reversed, it still wouldn't
>> work since there is no id
>> field as indicated by the error.
>>
>> Let's see one of the json files please? Are they carefully-formulated
>> or arbitrary files? If
>> carefully formulated, just switch
>>
>> Best,
>> Erick
>>
>> On Wed, Mar 30, 2016 at 11:26 AM, Aditya Desai  wrote:
>> > Hi Paul
>> >
>> > Thanks a lot for your help! I have one small question, I have schema
>> > that
>> > includes {Keyword,id,currency,geographic_name}. Now I have given
>> > id
>> > And
>> > 
>> > Whenever I am running your script I am getting an error as
>> >
>> > 
>> > 400> > name="QTime">2Document is
>> > missing mandatory uniqueKey field: id> > name="code">400
>> > 
>> >
>> > Can you please share your expertise advice here. Can you please guide me
>> > a
>> > good source to learn SOLR?
>> >
>> > I am learning and I would really appreciate if you can help me.
>> >
>> > Regards
>> >
>> >
>> > On Wed, Mar 30, 2016 at 6:55 AM, Paul Hoffman  wrote:
>> >
>> >> On Tue, Mar 29, 2016 at 11:30:06PM -0700, Aditya Desai wrote:
>> >> > I am running SOLR 4.10 on port 8984 by changing the default port in
>> >> > etc/jetty.xml. I am now trying to index all my JSON files to Solr
>> >> > running
>> >> > on 8984. The following is the command
>> >> >
>> >> > curl '
>> >>
>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
>> >> ' --data-binary *.json
>> >> > -H 'Content-type:application/json'
>> >>
>> >> The wildcard is the problem; your shell is expanding --data-binary
>> >> *.json to --data-binary foo.json bar.json baz.json and curl doesn't
>> >> know
>> >> how to download bar.json and baz.json.
>> >>
>> >> Try this instead:
>> >>
>> >> for file in *.json; do
>> >> curl '
>> >>
>> >> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_solr_update-3Fcommit-3Dtrue&d=CwIBAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=7B13rM0e1iuqzbXK9vK6b5luu5je3SpeGunT1bf-MWA&s=R9qSptMrt9o6C0BXmeQdtm3_bx4fFbABYFja2XUFylA&e=
>> >> ' --data-binary "$file" -H 'Content-type:application/json'
>> >> done
>> >>
>> >> Paul.
>> >>
>> >> --
>> >> Paul Hoffman 
>> >> Systems Librarian
>> >> Fenway Libraries Online
>> >> c/o Wentworth Institute of Technology
>> >> 550 Huntington Ave.
>> >> Boston, MA 02115
>> >> (617) 442-2384 (FLO main number)
>> >>
>> >
>> >
>> >
>> > --
>> > Aditya Ramachandra Desai
>> > MS Computer Science Graduate Student
>> > USC Viterbi School of Engineering
>> > Los Angeles, CA 90007
>> > M : +1-415-463-9864 | L :
>> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_adityardesai&d=CwIFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=aLfk1zsmx4LG4nTElFRiaw&m=ihbpCZYoNmoSqzckKlY5lkESOZXPuLtNIGjnLZCzj78&s=YD-dm-5blmQ07_4vYFoLz6r0NqKRNK1aHtIgHUvc48U&e=
>
>
>
>
> --
> Aditya Ramachandra Desai
> MS Computer Science Graduate Student
> USC Viterbi School of Engineering
> Los Angeles, CA 90007
> M : +1-415-463-9864 | L : https://www.linkedin.com/in/adityardesai
>


Re: Solr response error 403 when I try to index medium.com articles

2016-03-30 Thread Jeferson dos Anjos
Great, but there is any way to change solr header to set user-agent?

2016-03-30 17:13 GMT-03:00 Jack Krupansky :

> You could use the curl command to read a URL on Medium.com. That would let
> you examine and control the headers to experiment.
>
> Google is able to index Medium.
>
> Check the URL and make sure it's not on one of the paths disallowed by
> medium.com/robots.txt (the one you gave seems fine):
>
> User-Agent: *
> Disallow: /_/
> Disallow: /m/
> Disallow: /me/
> Disallow: /@me$
> Disallow: /@me/
> Disallow: /*/*/edit
> Sitemap: https://medium.com/sitemap/sitemap.xml
>
>
>
> -- Jack Krupansky
>
> On Wed, Mar 30, 2016 at 1:05 PM, Chris Hostetter  >
> wrote:
>
> >
> > 403 means "forbidden"
> >
> > Something about the request Solr is sending -- or soemthing about the IP
> > address Solr is connecting from when talking to medium.com -- is causing
> > hte medium.com web server to reject the request.
> >
> > This is something that servers may choose to do if they detect (via
> > headers, or missing headers, or reverse ip lookup, or other
> > distinctive nuances of how the connection was made) that the
> > client connecting to their server isn't a "human browser" (ie: firefox,
> > chrome, safari) and is a Robot that they don't want to cooperate with
> (ie:
> > they might be happy toserve their pages to the google-bot crawler, but
> not
> > to some third-party they've never heard of.
> >
> > The specifics of how/why you might get a 403 for any given url are hard
> to
> > debug -- it might literally depend on how many requests you've sent
> tothat
> > domain in the past X hours.
> >
> > In general Solr's ContentStream indexing from remote hosts isn't inteded
> > to be a super robust solution for crawling arbitrary websites on the web
> > -- if that's your goal, then i would suggest you look into running a more
> > robust crawler (nutch, droids, Lucidworks Fusion, etc...) that has more
> > features and debugging options (notably: rate limiting) and use that code
> > to feath the content, then push it to Solr.
> >
> >
> > : Date: Tue, 29 Mar 2016 20:54:52 -0300
> > : From: Jeferson dos Anjos 
> > : Reply-To: solr-user@lucene.apache.org
> > : To: solr-user@lucene.apache.org
> > : Subject: Solr response error 403 when I try to index medium.com
> articles
> > :
> > : I'm trying to index some pages of the medium. But I get error 403. I
> > : believe it is because the medium does not accept the user-agent solr.
> Has
> > : anyone ever experienced this? You know how to change?
> > :
> > : I appreciate any help
> > :
> > : 
> > : 500
> > : 94
> > : 
> > : 
> > : 
> > : Server returned HTTP response code: 403 for URL:
> > :
> >
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> > : 
> > : 
> > : java.io.IOException: Server returned HTTP response code: 403 for URL:
> > :
> >
> https://medium.com/@producthunt/10-mac-menu-bar-apps-you-can-t-live-without-df087d2c6b1
> > : at sun.reflect.GeneratedConstructorAccessor314.newInstance(Unknown
> > : Source) at
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
> > : Source) at java.lang.reflect.Constructor.newInstance(Unknown Source)
> > : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> > : at sun.net.www.protocol.http.HttpURLConnection$10.run(Unknown Source)
> > : at java.security.AccessController.doPrivileged(Native Method) at
> > : sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown
> > : Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown
> > : Source) at
> > sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown
> > : Source) at
> > sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown
> > : Source) at
> >
> org.apache.solr.common.util.ContentStreamBase$URLStream.getStream(ContentStreamBase.java:87)
> > : at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> > : at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > : at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:144)
> > : at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:291)
> > : at org.apache.solr.core.SolrCore.execute(SolrCore.java:2006) at
> > :
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> > : at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
> > : at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:204)
> > : at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> > : at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> > : at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> > : at
> >
> org.eclipse.jetty.secu

Re: No live SolrServers available to handle this request

2016-03-30 Thread Anil
Thanks Shawn and Elaine,

Elaine,
Yes all the documents of same route key resides on same shard.

Shawn,
I will try to capture the logs.

Thanks.

Regards,
Anil



On 25 March 2016 at 02:57, Elaine Cario  wrote:

> Anil,
>
> I've seen situations where if there was a problem with a specific query,
> and every shard responds with the same error, the actual exception gets
> hidden  by a "No live SolrServers..." exception.  We originally saw this
> with wildcard queries (when every shard reported a "too many expansions..."
> type error, but the exception in the response was "No live SolrServers..."
> error.
>
> You mention that you are using collapse/expand, and that you have shards -
> that could possibly cause some issue, as I think collapse and expand only
> work correctly if the data for any particular collapse value resides on one
> shard.
>
> On Sat, Mar 19, 2016 at 1:04 PM, Shawn Heisey  wrote:
>
> > On 3/18/2016 9:55 PM, Anil wrote:
> > > Thanks for your response.
> > > CDH is a Cloudera (third party) distribution. is there any to get the
> > > notifications copy of it when cluster state changed ? in logs ?
> > >
> > > I can assume that the exception is result of no availability of
> replicas
> > > only. Agree?
> >
> > Yes, I think that Solr believes there are no replicas for at least one
> > shard.  As for why it believes that, I cannot say.
> >
> > If Solr logged every single thing that happened where zookeeper (or even
> > just the clusterstate) is involved, you'd be drowning in logs.  Much
> > more than already happens.  The logfile is already very verbose.
> >
> > Chances are that at least one of your Solr nodes *did* log something
> > related to a problem with that collection before you got the error
> > you're asking about.
> >
> > The "No live SolrServers" error is one that people are seeing quite
> > frequently.  There may be some instances where Solr isn't behaving
> > correctly, but I think when this happens, it usually indicates there's a
> > real problem of some kind.
> >
> > To troubleshoot, we'll need to see any errors or warnings you find in
> > your Solr logfiles from the time before you get an error on a request.
> > You'll need to check the logfile on all Solr nodes.
> >
> > It might be a good idea to also involve Cloudera support, see what they
> > think.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: How to implement Autosuggestion

2016-03-30 Thread chandan khatri
Hi All,

I've similar query regarding autosuggestion. My use case is as below:

1. User enters product name (say Nokia)
2. I want suggestions along with the category with which the product
belongs. (e.g Nokia belongs to "electronics" and "mobile" category) so I
want suggestion like Nokia in electronics and Nokia in mobile.

I am able to get the suggestions using the OOTB AnalyzingInFixSuggester but
not sure how I can get the category along with the suggestion (can this
category be considered as facet of the suggestion??)

Any help/pointer is highly appreciated.

Thanks,
Chandan

On Wed, Mar 30, 2016 at 1:37 PM, Alessandro Benedetti  wrote:

> Hi Mugeesh, autocompletion world is not that simple as you would expect.
> Which kind of auto suggestion are you interested in ?
>
> First of all, simple string autosuggestion or document autosuggestion ? (
> with more additional field to show then the label)
> Are you interested in the analysis for the text to suggest ? Fuzzy
> suggestions ? exact "beginning of the phrase" suggestions ? infix
> suggestions ?
> Try to give some example and we could help better .
> There is a specific suggester component, so it is likely to be useful to
> you, but let's try to discover more.
>
> Cheers
>
> On Mon, Mar 28, 2016 at 6:03 PM, Reth RM  wrote:
>
> > Solr AnalyzingInfix suggester component:
> > https://lucidworks.com/blog/2015/03/04/solr-suggester/
> >
> >
> >
> > On Mon, Mar 28, 2016 at 7:57 PM, Mugeesh Husain 
> wrote:
> >
> > > Hi,
> > >
> > > I am looking for the best way to implement autosuggestion in ecommerce
> > > using solr or elasticsearch.
> > >
> > > I guess using ngram analyzer is not a good way if data is big.
> > >
> > >
> > > Please suggest me any link or your opinion ?
> > >
> > >
> > >
> > > Thanks
> > > Mugeesh
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/How-to-implement-Autosuggestion-tp4266434.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>


Re: Solr not working on new environment

2016-03-30 Thread Jarus Bosman
OK, solved. It seems I had to first create a core, then configure Drupal to
point to the path for that core.

I have to say, this is one of the more helpful lists I have used. Thanks a
lot for your help!



"Getting information off the Internet is like taking a drink from a fire
hydrant." - Mitchell Kapor

 .---.  .-.   .-..-.   .-.,'|"\.---.,--,
/ .-. )  ) \_/ /  \ \_/ )/| |\ \  / .-. ) .' .'
| | |(_)(_)   /\   (_)| | \ \ | | |(_)|  |  __
| | | |   / _ \ ) (   | |  \ \| | | | \  \ ( _)
\ `-' /  / / ) \| |   /(|`-' /\ `-' /  \  `-) )
 )---'  `-' (_)-'  /(_|  (__)`--'  )---'   )\/
(_)   (__)(_) (__)

On Wed, Mar 30, 2016 at 5:51 PM, Erick Erickson 
wrote:

> Whoa! I thought you were going for SolrCloud. If you're not interested in
> SolrCloud, you don't need to know anything about Zookeeper.
>
> So it looks like Solr is running. You say:
>
> bq:  When I try to connect to :8983/solr, I get a timeout.
> Does it sound like firewall issues?
>
> are you talking about Drupal or about a simple browser connection? If
> the former, I'm all out of ideas
> as I know very little about the Drupal integration and/or whether it's
> even possible with a 5.x...
>
> Best,
> Erick
>
> On Wed, Mar 30, 2016 at 2:52 AM, Jarus Bosman  wrote:
> > OK, an update. I managed to remove the example/cloud directories, and
> stop
> > Solr. I changed my startup script to be much simpler (./solr start) and
> now
> > I get this:
> >
> > *[root@ bin]# ./startsolr.sh*
> > *Waiting up to 30 seconds to see Solr running on port 8983 [|]*
> > *Started Solr server on port 8983 (pid=31937). Happy searching!*
> > *
> >
> >  [root@nationalarchives bin]#
> > ./solr status*
> >
> > *Found 1 Solr nodes:*
> >
> > *Solr process 31937 running on port 8983*
> > *{*
> > *  "solr_home":"/opt/solr-5.5.0/server/solr",*
> > *  "version":"5.5.0 2a228b3920a07f930f7afb6a42d0d20e184a943c - mike -
> > 2016-02-16 15:22:52",*
> > *  "startTime":"2016-03-30T09:24:21.445Z",*
> > *  "uptime":"0 days, 0 hours, 3 minutes, 9 seconds",*
> > *  "memory":"62 MB (%12.6) of 490.7 MB"}*
> >
> > I now want to connect to it from my Drupal installation, but I'm getting
> > this: "The Solr server could not be reached. Further data is therefore
> > unavailable." - I realise this is probably not a Solr error, just giving
> > all the information I have. When I try to connect to
> > :8983/solr, I get a timeout. Does it sound like firewall
> issues?
> >
> > Regards,
> > Jarus
> >
> > "Getting information off the Internet is like taking a drink from a fire
> > hydrant." - Mitchell Kapor
> >
> >  .---.  .-.   .-..-.   .-.,'|"\.---.,--,
> > / .-. )  ) \_/ /  \ \_/ )/| |\ \  / .-. ) .' .'
> > | | |(_)(_)   /\   (_)| | \ \ | | |(_)|  |  __
> > | | | |   / _ \ ) (   | |  \ \| | | | \  \ ( _)
> > \ `-' /  / / ) \| |   /(|`-' /\ `-' /  \  `-) )
> >  )---'  `-' (_)-'  /(_|  (__)`--'  )---'   )\/
> > (_)   (__)(_) (__)
> >
> > On Wed, Mar 30, 2016 at 8:50 AM, Jarus Bosman  wrote:
> >
> >> Hi Erick,
> >>
> >> Thanks for the reply. It seems I have not done all my homework yet.
> >>
> >> We used to use Solr 3.6.2 on the old environment (we're using it in
> >> conjunction with Drupal). When I got connectivity problems on the new
> >> server, I decided to rather implement the latest version of Solr
> (5.5.0). I
> >> read the Quick Start documentation and expected it to work first time,
> but
> >> not so (as per my previous email). I will read up a bit on ZooKeeper
> (never
> >> heard of it before - What is it?). Is there a good place to read up on
> >> getting started with ZooKeeper and the latest versions of Solr (apart
> from
> >> what you have replied, of course)?
> >>
> >> Thank you so much for your assistance,
> >> Jarus
> >>
> >>
> >> "Getting information off the Internet is like taking a drink from a fire
> >> hydrant." - Mitchell Kapor
> >>
> >>  .---.  .-.   .-..-.   .-.,'|"\.---.,--,
> >> / .-. )  ) \_/ /  \ \_/ )/| |\ \  / .-. ) .' .'
> >> | | |(_)(_)   /\   (_)| | \ \ | | |(_)|  |  __
> >> | | | |   / _ \ ) (   | |  \ \| | | | \  \ ( _)
> >> \ `-' /  / / ) \| |   /(|`-' /\ `-' /  \  `-) )
> >>  )---'  `-' (_)-'  /(_|  (__)`--'  )---'   )\/
> >> (_)   (__)(_) (__)
> >>
> >> On Wed, Mar 30, 2016 at 6:20 AM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >>
> >>> Good to meet you!
> >>>
> >>> It looks like you've tried to start Solr a time or two. When you start
> >>> up the "cloud" example
> >>> it creates
> >>> /opt/solr-5.5.0/example/cloud
> >>> and puts your SolrCloud stuff under there. It also automatically puts
> >>> your configuration
> >>> sets up on Zookeeper. When I get this kind of thing, I usually
> >>>
> >>> > stop Zookeeper (if running externally)
> >>>
> >>> > rm -rf /opt/solr-5.5.0/example/cloud
> >>>
> >>> > delete all the Zookeeper data. It may take a bit of poking to fin