Accent insensitive search for greek characters

2017-10-13 Thread Chitra
Hi,

   I want to search greek characters(with accent insensitive) by removing
or replacing accent marks with similar characters.

Eg: when searching a greek accent word say *πῬοἲὅν*, we expect accent
insensitive search ie need equivalent greek accent like *προιον*



Moreover, I am not having more knowledge on Greek characters. so only I am
looking for standard rules to perform greek accent insensitive search.


Does *ICUFoldingFilter* solve my case? I have tried this already. Its
working fine for greek accent characters. But this is not language
specific... It has internalization support for all languages. Here, I am
not sure whether it will break my existing language behavior in the index.


Is there any way to make ICUFoldingFilter as language specific?



-- 
Regards,
Chitra


Concern on solr commit

2017-10-13 Thread Leo Prince
Hi,

I am new to community and thank you for letting me in.

Let me get into my concern real quick. Please find my OS and Solr versions

Ubuntu 14.04.4 LTS
solr-spec 4.10.2
solr-impl  4.10.2 1634293 - mike - 2014-10-26 05:56:21
lucene-spec  4.10.2
lucene-impl   4.10.2 1634293 - mike - 2014-10-26 05:51:56

I am getting the following errors/warnings from Solr

1, ERROR: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
try again later.
2, PERFORMANCE WARNING: Overlapping onDeckSearchers=2
3, WARN: DistributedUpdateProcessor error sending update
   org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
try again later.

I have gone through various discussion threads and was able to get a
minor idea on scenarios where maxWarmingSearchers exceeds. We have
commit doing from application for individual requests and autocommit
enabled.

So my concern is, is there any chance of performance issues when
number of commits are high at a particular point of time. In our
application, we are approximating like 100-500 commits can happen
simultaneously from application and autocommit too for those
individual requests which are not committing individually after the
write.

Autocommit is configured as follows,


   15000
   false



When worked on the openSearcher parameter, I see it makes solr not to
open a new search while commit is undergoing. How many seconds or
milliseconds, solr takes to complete a commit and allow further new
searches to spawn..? Considering we have a massive numerous individual
commits and massive doc autocommit, is setting up openSearcher to
false, a good practice..? What happens to the new requests when a
commit is going on. Will it be queued  for new search to spawn or will
it be delivered with the old search existed..?

Thanks in advance,

Leo Prince.


Re: Concern on solr commit

2017-10-13 Thread Emir Arnautović
Hi Leo,
It is considered a bad practice to commit from your application. You should let 
Solr handle commits. There is a great article about soft and hard commits: 
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 


If you really want to commit from your application, then you should use 
commitWithin parameter that would group your commits in a single commit, and 
you would be sure that your changes are committed within some time (you can do 
that with autocommit as well).

Opening searcher can be fast but it can also last for a while - it depends on 
warming parameters that you set. In any case, I would not recommend that you 
focus on making opening last less, but to see what are your NRT requirements 
and commit as rare as possible.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 13 Oct 2017, at 10:30, Leo Prince  wrote:
> 
> Hi,
> 
> I am new to community and thank you for letting me in.
> 
> Let me get into my concern real quick. Please find my OS and Solr versions
> 
> Ubuntu 14.04.4 LTS
> solr-spec 4.10.2
> solr-impl  4.10.2 1634293 - mike - 2014-10-26 05:56:21
> lucene-spec  4.10.2
> lucene-impl   4.10.2 1634293 - mike - 2014-10-26 05:51:56
> 
> I am getting the following errors/warnings from Solr
> 
> 1, ERROR: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
> try again later.
> 2, PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> 3, WARN: DistributedUpdateProcessor   error sending update
>   org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
> try again later.
> 
> I have gone through various discussion threads and was able to get a
> minor idea on scenarios where maxWarmingSearchers exceeds. We have
> commit doing from application for individual requests and autocommit
> enabled.
> 
> So my concern is, is there any chance of performance issues when
> number of commits are high at a particular point of time. In our
> application, we are approximating like 100-500 commits can happen
> simultaneously from application and autocommit too for those
> individual requests which are not committing individually after the
> write.
> 
> Autocommit is configured as follows,
> 
> 
>   15000
>   false
> 
> 
> 
> When worked on the openSearcher parameter, I see it makes solr not to
> open a new search while commit is undergoing. How many seconds or
> milliseconds, solr takes to complete a commit and allow further new
> searches to spawn..? Considering we have a massive numerous individual
> commits and massive doc autocommit, is setting up openSearcher to
> false, a good practice..? What happens to the new requests when a
> commit is going on. Will it be queued  for new search to spawn or will
> it be delivered with the old search existed..?
> 
> Thanks in advance,
> 
> Leo Prince.



Solr related questions

2017-10-13 Thread startrekfan
Hello,

I have some Solr related questions:

1.) I created a core and tried to simplify the managed-schema file. But if
I remove all "unecessary" fields/fieldtypes, I get errors like: field
"_version_" is missing, type "boolean" is missing and so on. Why do I have
to define this types/fields? Which fields/fieldtypes are required?

2.) Can I modify the managed-schema remotly/by program e.g. with a post
request or only by editing the managed-schema file directly?

3.) When I have a service(solrnet client) that pushes a file from a
fileserver to solr, will it cause two times traffic? (from the fileserver
to my service and from the service to solr?) Is there a chance to index the
file direct? (I need to add additional attributes to the index document)

Thank you


Appending fields to pre-existed document

2017-10-13 Thread Игорь Абрашин
Hello, solr community.
We are getting strugled with updating already existing docs. For instance,
we got indexed one jpg with tika parser and got batch of attributes. Then
we want to index database datasource and append those fields to our
document with the same uniquekey, stored at schema.xml. And all what we got
only a overwriting doc came first by new one. Ok just put overwrite=false
to params, and dublicating docs appeare. So do you have some clues or
suggesstions related to that. How to append one batch of attribute to
another? Or maybe how to merge them after duplicate was created?


Re: Solr related questions

2017-10-13 Thread Amrit Sarkar
Hi,

1.) I created a core and tried to simplify the managed-schema file. But if
> I remove all "unecessary" fields/fieldtypes, I get errors like: field
> "_version_" is missing, type "boolean" is missing and so on. Why do I have
> to define this types/fields? Which fields/fieldtypes are required?


Solr expects the primitive field names and types in the schema. Though a
better explanation should be there. "_version_" and a unique id field is
mandatory for each document as "_version_" contains the current version of
the document utilised in sync across nodes and atomic updation of the
documents.

 2.) Can I modify the managed-schema remotly/by program e.g. with a post

request or only by editing the managed-schema file directly?

Sure, Schema API is available to us for a while:
https://lucene.apache.org/solr/guide/6_6/schema-api.html

3.) When I have a service(solrnet client) that pushes a file from a
> fileserver to solr, will it cause two times traffic? (from the fileserver
> to my service and from the service to solr?) Is there a chance to index the
> file direct? (I need to add additional attributes to the index document)


Two times traffic? where? Solr will receive the docs once so we are good at
that part. Please utilize the SolrJ to index documents if possible, as it
is most updates one, if you are on solrcloud, use CloudSolrJClient.
Regarding index files direct, you can utilize the DIH (DataImportHandler),
depends on the file format, its csv, xml, json, but mind it is single
threaded.

Hope this clarifies some of it.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 3:10 PM, startrekfan 
wrote:

> Hello,
>
> I have some Solr related questions:
>
> 1.) I created a core and tried to simplify the managed-schema file. But if
> I remove all "unecessary" fields/fieldtypes, I get errors like: field
> "_version_" is missing, type "boolean" is missing and so on. Why do I have
> to define this types/fields? Which fields/fieldtypes are required?
>
> 2.) Can I modify the managed-schema remotly/by program e.g. with a post
> request or only by editing the managed-schema file directly?
>
> 3.) When I have a service(solrnet client) that pushes a file from a
> fileserver to solr, will it cause two times traffic? (from the fileserver
> to my service and from the service to solr?) Is there a chance to index the
> file direct? (I need to add additional attributes to the index document)
>
> Thank you
>


Re: Appending fields to pre-existed document

2017-10-13 Thread Rick Leir
Hi
Show us the solr version,  field types, the handler definition, and the query 
you send. Any log entries?
Cheers -- Rick

On October 13, 2017 5:57:16 AM EDT, "Игорь Абрашин"  
wrote:
>Hello, solr community.
>We are getting strugled with updating already existing docs. For
>instance,
>we got indexed one jpg with tika parser and got batch of attributes.
>Then
>we want to index database datasource and append those fields to our
>document with the same uniquekey, stored at schema.xml. And all what we
>got
>only a overwriting doc came first by new one. Ok just put
>overwrite=false
>to params, and dublicating docs appeare. So do you have some clues or
>suggesstions related to that. How to append one batch of attribute to
>another? Or maybe how to merge them after duplicate was created?

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Concern on solr commit

2017-10-13 Thread Leo Prince
Hi Emir,

Thanks for the response.

We have specific near realtime search requirements, that is why we are
explicitly invoking Solr commits. However we will try to improve on
reducing the commits from application. In the meantime, the errors/warnings
I mentioned in my previous mail; are they really due to frequent commits..?

We are having plenty of requests taking more seconds to deliver from solr
irrespective of network overhead, so any thoughts whether commit frequency
affects solr latency..?

Thanks,
Leo Prince

On Fri, Oct 13, 2017 at 2:46 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Leo,
> It is considered a bad practice to commit from your application. You
> should let Solr handle commits. There is a great article about soft and
> hard commits: https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/ <
> https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/>
>
> If you really want to commit from your application, then you should use
> commitWithin parameter that would group your commits in a single commit,
> and you would be sure that your changes are committed within some time (you
> can do that with autocommit as well).
>
> Opening searcher can be fast but it can also last for a while - it depends
> on warming parameters that you set. In any case, I would not recommend that
> you focus on making opening last less, but to see what are your NRT
> requirements and commit as rare as possible.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 13 Oct 2017, at 10:30, Leo Prince 
> wrote:
> >
> > Hi,
> >
> > I am new to community and thank you for letting me in.
> >
> > Let me get into my concern real quick. Please find my OS and Solr
> versions
> >
> > Ubuntu 14.04.4 LTS
> > solr-spec 4.10.2
> > solr-impl  4.10.2 1634293 - mike - 2014-10-26 05:56:21
> > lucene-spec  4.10.2
> > lucene-impl   4.10.2 1634293 - mike - 2014-10-26 05:51:56
> >
> > I am getting the following errors/warnings from Solr
> >
> > 1, ERROR: org.apache.solr.client.solrj.impl.HttpSolrServer$
> RemoteSolrException:
> > Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
> > try again later.
> > 2, PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> > 3, WARN: DistributedUpdateProcessor   error sending update
> >   org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> > Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
> > try again later.
> >
> > I have gone through various discussion threads and was able to get a
> > minor idea on scenarios where maxWarmingSearchers exceeds. We have
> > commit doing from application for individual requests and autocommit
> > enabled.
> >
> > So my concern is, is there any chance of performance issues when
> > number of commits are high at a particular point of time. In our
> > application, we are approximating like 100-500 commits can happen
> > simultaneously from application and autocommit too for those
> > individual requests which are not committing individually after the
> > write.
> >
> > Autocommit is configured as follows,
> >
> > 
> >   15000
> >   false
> > 
> >
> >
> > When worked on the openSearcher parameter, I see it makes solr not to
> > open a new search while commit is undergoing. How many seconds or
> > milliseconds, solr takes to complete a commit and allow further new
> > searches to spawn..? Considering we have a massive numerous individual
> > commits and massive doc autocommit, is setting up openSearcher to
> > false, a good practice..? What happens to the new requests when a
> > commit is going on. Will it be queued  for new search to spawn or will
> > it be delivered with the old search existed..?
> >
> > Thanks in advance,
> >
> > Leo Prince.
>
>


Re: Solr related questions

2017-10-13 Thread Rick Leir
1/ the _version_ field is necessary.
2/ there is a Solr api for editing the manged schema
3/ not having used solrnet, I suspect you can bypass it and use the solr REST 
api directly.
Cheers -- Rick

On October 13, 2017 5:40:26 AM EDT, startrekfan  
wrote:
>Hello,
>
>I have some Solr related questions:
>
>1.) I created a core and tried to simplify the managed-schema file. But
>if
>I remove all "unecessary" fields/fieldtypes, I get errors like: field
>"_version_" is missing, type "boolean" is missing and so on. Why do I
>have
>to define this types/fields? Which fields/fieldtypes are required?
>
>2.) Can I modify the managed-schema remotly/by program e.g. with a post
>request or only by editing the managed-schema file directly?
>
>3.) When I have a service(solrnet client) that pushes a file from a
>fileserver to solr, will it cause two times traffic? (from the
>fileserver
>to my service and from the service to solr?) Is there a chance to index
>the
>file direct? (I need to add additional attributes to the index
>document)
>
>Thank you

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
1) "_version_" is not "unecessary", actually the contrary, it is fundamendal
for Solr to work.
The same for types you use across your field definitions.
There was a time you could see these comments in the schema.xml (doesn't
seem the case anymore):

 
   
   
   
   

  


2) https://lucene.apache.org/solr/guide/6_6/schema-api.html , yes you can

3)Unless your files are local to the process you will use to push them to
Solr, you will have "two times traffic" indipendently of the client
technology.

Cheers



[1]



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Geometries distance

2017-10-13 Thread Maruska Melucci
Hi

I need to obtain distance between point and MULTILINESTRING.
MULTILINESTRINGS are stored in Solr as geometry using JST library
configured as follow:



I'm using geodist to return distance between a coord an the geometries  but
the funciotn seems to work wrong, it doesn't return the right distance.

Is it possible to obtain distance between point and MULTILINESTRING using
Solr?
Is it possible to obtain distance using  JST library?

I'm using Solr 7

Thank you

Maruska


Re: Concern on solr commit

2017-10-13 Thread Emir Arnautović
Hi Leo,
Errors that you are seeing are related to frequent commits - new commits is 
issued before searcher for previous commit is opened and warmed.

I haven’t looked in indexing code in a while, but if assume that it did not 
change, commits and writes are mutually exclusive - guarded by the same lock. 
So yes - frequent commits will result in longer index request latency. Also 
each commit results in new segment and segments are merged - small segment and 
merge is fast, but still overhead.

I would suggest that you change explicit commits with at least commitWithin=100 
(or as much as possible) - that will result in at least 100ms worth of 
documents being grouped into single commit.

Also, if you are committing this frequently, think of turning off your caches 
since they are invalidated on each commit.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 13 Oct 2017, at 13:06, Leo Prince  wrote:
> 
> Hi Emir,
> 
> Thanks for the response.
> 
> We have specific near realtime search requirements, that is why we are
> explicitly invoking Solr commits. However we will try to improve on
> reducing the commits from application. In the meantime, the errors/warnings
> I mentioned in my previous mail; are they really due to frequent commits..?
> 
> We are having plenty of requests taking more seconds to deliver from solr
> irrespective of network overhead, so any thoughts whether commit frequency
> affects solr latency..?
> 
> Thanks,
> Leo Prince
> 
> On Fri, Oct 13, 2017 at 2:46 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Leo,
>> It is considered a bad practice to commit from your application. You
>> should let Solr handle commits. There is a great article about soft and
>> hard commits: https://lucidworks.com/2013/08/23/understanding-
>> transaction-logs-softcommit-and-commit-in-sorlcloud/ <
>> https://lucidworks.com/2013/08/23/understanding-
>> transaction-logs-softcommit-and-commit-in-sorlcloud/>
>> 
>> If you really want to commit from your application, then you should use
>> commitWithin parameter that would group your commits in a single commit,
>> and you would be sure that your changes are committed within some time (you
>> can do that with autocommit as well).
>> 
>> Opening searcher can be fast but it can also last for a while - it depends
>> on warming parameters that you set. In any case, I would not recommend that
>> you focus on making opening last less, but to see what are your NRT
>> requirements and commit as rare as possible.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 13 Oct 2017, at 10:30, Leo Prince 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> I am new to community and thank you for letting me in.
>>> 
>>> Let me get into my concern real quick. Please find my OS and Solr
>> versions
>>> 
>>> Ubuntu 14.04.4 LTS
>>> solr-spec 4.10.2
>>> solr-impl  4.10.2 1634293 - mike - 2014-10-26 05:56:21
>>> lucene-spec  4.10.2
>>> lucene-impl   4.10.2 1634293 - mike - 2014-10-26 05:51:56
>>> 
>>> I am getting the following errors/warnings from Solr
>>> 
>>> 1, ERROR: org.apache.solr.client.solrj.impl.HttpSolrServer$
>> RemoteSolrException:
>>> Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
>>> try again later.
>>> 2, PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>>> 3, WARN: DistributedUpdateProcessor   error sending update
>>>  org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>> Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
>>> try again later.
>>> 
>>> I have gone through various discussion threads and was able to get a
>>> minor idea on scenarios where maxWarmingSearchers exceeds. We have
>>> commit doing from application for individual requests and autocommit
>>> enabled.
>>> 
>>> So my concern is, is there any chance of performance issues when
>>> number of commits are high at a particular point of time. In our
>>> application, we are approximating like 100-500 commits can happen
>>> simultaneously from application and autocommit too for those
>>> individual requests which are not committing individually after the
>>> write.
>>> 
>>> Autocommit is configured as follows,
>>> 
>>> 
>>>  15000
>>>  false
>>> 
>>> 
>>> 
>>> When worked on the openSearcher parameter, I see it makes solr not to
>>> open a new search while commit is undergoing. How many seconds or
>>> milliseconds, solr takes to complete a commit and allow further new
>>> searches to spawn..? Considering we have a massive numerous individual
>>> commits and massive doc autocommit, is setting up openSearcher to
>>> false, a good practice..? What happens to the new requests when a
>>> commit is going on. Will it be queued  for new search to spawn or will
>>> it be delivered with the old search existed..?
>>> 
>>>

Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
Nabble mutilated my reply :

*Comment*: If you remove this field, you must _also_ disable the update log
in solrconfig.xml
  or Solr won't start. _version_ and update log are required for
SolrCloud
   
   
*Comment*:points to the root document of a block of nested documents.
Required for nested
  document support, may be removed otherwise
   

*Comment*:Only remove the "id" field if you have a very good reason to.
While not strictly
 required, it is highly recommended. A  is present in almost
all Solr 
 installations. See the  declaration below where 
is set to "id".




-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

You are getting NPE at:

String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL

// related code

String rawContentType = conn.getContentType();

public String getContentType() {
return getHeaderField("content-type");
}

HttpURLConnection conn = (HttpURLConnection) u.openConnection();

Can you check at your webpage level headers are properly set and it
has key "content-type".


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Wed, Oct 11, 2017 at 9:08 PM, Kevin Layer  wrote:

> I want to use solr to index a markdown website.  The files
> are in native markdown, but they are served in HTML (by markserv).
>
> Here's what I did:
>
> docker run --name solr -d -p 8983:8983 -t solr
> docker exec -it --user=solr solr bin/solr create_core -c handbook
>
> Then, to crawl the site:
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.solr.util.SimplePostTool$PageFetcher.
> readPageFromUrl(SimplePostTool.java:1138)
> at org.apache.solr.util.SimplePostTool.webCrawl(
> SimplePostTool.java:603)
> at org.apache.solr.util.SimplePostTool.postWebPages(
> SimplePostTool.java:563)
> at org.apache.solr.util.SimplePostTool.doWebMode(
> SimplePostTool.java:365)
> at org.apache.solr.util.SimplePostTool.execute(
> SimplePostTool.java:187)
> at org.apache.solr.util.SimplePostTool.main(
> SimplePostTool.java:172)
> quadra[git:master]$
>
>
> Any ideas on what I did wrong?
>
> Thanks.
>
> Kevin
>


Re: Solr related questions

2017-10-13 Thread startrekfan
Thank you for your answer.

To 3.)
The file is on server A, my program is on server B and  solr is on server
C. If I use a normal http(rest) post, my program has to fetch the file
content from server A to Server B and then post it from server B to server
C as there is no open connection between A and C. So the file has to be
transmitted two times.
Is there a way to tell solr to read the file _directly_ from Server A (e.g.
via SMB)

Thank you


Amrit Sarkar  schrieb am Fr., 13. Okt. 2017 um
12:51 Uhr:

> Hi,
>
> 1.) I created a core and tried to simplify the managed-schema file. But if
> > I remove all "unecessary" fields/fieldtypes, I get errors like: field
> > "_version_" is missing, type "boolean" is missing and so on. Why do I
> have
> > to define this types/fields? Which fields/fieldtypes are required?
>
>
> Solr expects the primitive field names and types in the schema. Though a
> better explanation should be there. "_version_" and a unique id field is
> mandatory for each document as "_version_" contains the current version of
> the document utilised in sync across nodes and atomic updation of the
> documents.
>
>  2.) Can I modify the managed-schema remotly/by program e.g. with a post
>
> request or only by editing the managed-schema file directly?
>
> Sure, Schema API is available to us for a while:
> https://lucene.apache.org/solr/guide/6_6/schema-api.html
>
> 3.) When I have a service(solrnet client) that pushes a file from a
> > fileserver to solr, will it cause two times traffic? (from the fileserver
> > to my service and from the service to solr?) Is there a chance to index
> the
> > file direct? (I need to add additional attributes to the index document)
>
>
> Two times traffic? where? Solr will receive the docs once so we are good at
> that part. Please utilize the SolrJ to index documents if possible, as it
> is most updates one, if you are on solrcloud, use CloudSolrJClient.
> Regarding index files direct, you can utilize the DIH (DataImportHandler),
> depends on the file format, its csv, xml, json, but mind it is single
> threaded.
>
> Hope this clarifies some of it.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269 <(415)%20589-9269>
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 3:10 PM, startrekfan 
> wrote:
>
> > Hello,
> >
> > I have some Solr related questions:
> >
> > 1.) I created a core and tried to simplify the managed-schema file. But
> if
> > I remove all "unecessary" fields/fieldtypes, I get errors like: field
> > "_version_" is missing, type "boolean" is missing and so on. Why do I
> have
> > to define this types/fields? Which fields/fieldtypes are required?
> >
> > 2.) Can I modify the managed-schema remotly/by program e.g. with a post
> > request or only by editing the managed-schema file directly?
> >
> > 3.) When I have a service(solrnet client) that pushes a file from a
> > fileserver to solr, will it cause two times traffic? (from the fileserver
> > to my service and from the service to solr?) Is there a chance to index
> the
> > file direct? (I need to add additional attributes to the index document)
> >
> > Thank you
> >
>


Re: Appending fields to pre-existed document

2017-10-13 Thread Игорь Абрашин
Hi, Rich.
Here what've got:
Solr version 7.0.0
Fields definitions in schema.xml for dataimport datasource (our database)

And batch of edentical fields

url_path

Fields definitions in schema.xml for updateExtract hadler

And other field wich is not important in our case.
So our goal to combine attributes comes from database and extracted content
comes from files.
/dataimport handler is totally as default.
/update/extract the same situation..get it from solr in box build.
Data-config.xml pretty simple. Do not have any deviation from examples
 We using curl to post file and extract content from stream. For data
import we was trying full-import and delta-import no success for both.



13 окт. 2017 г. 15:56 пользователь "Rick Leir"  написал:

> Hi
> Show us the solr version,  field types, the handler definition, and the
> query you send. Any log entries?
> Cheers -- Rick
>
> On October 13, 2017 5:57:16 AM EDT, "Игорь Абрашин" 
> wrote:
> >Hello, solr community.
> >We are getting strugled with updating already existing docs. For
> >instance,
> >we got indexed one jpg with tika parser and got batch of attributes.
> >Then
> >we want to index database datasource and append those fields to our
> >document with the same uniquekey, stored at schema.xml. And all what we
> >got
> >only a overwriting doc came first by new one. Ok just put
> >overwrite=false
> >to params, and dublicating docs appeare. So do you have some clues or
> >suggesstions related to that. How to append one batch of attribute to
> >another? Or maybe how to merge them after duplicate was created?
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: Solr related questions

2017-10-13 Thread alessandro.benedetti
The only way Solr will fetch documents is through the Data Import Handler.
Take a look to the URLDataSource[1] to see if it fits.
Possibly you will need to customize it.

[1]
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#urldatasource



-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Appending fields to pre-existed document

2017-10-13 Thread alessandro.benedetti
Hi,
"And all what we got 
only a overwriting doc came first by new one. Ok just put overwrite=false 
to params, and dublicating docs appeare."

What is exactly the doc you get ?
Are the fields originally in the first doc before the atomic update stored ?
This is what you need to use :

https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html

If you don't, Solr by default will just overwrite the entire document.




-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> You are getting NPE at:
>> 
>> String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL
>> 
>> // related code
>> 
>> String rawContentType = conn.getContentType();
>> 
>> public String getContentType() {
>> return getHeaderField("content-type");
>> }
>> 
>> HttpURLConnection conn = (HttpURLConnection) u.openConnection();
>> 
>> Can you check at your webpage level headers are properly set and it
>> has key "content-type".

Amrit, this is markserv, and I just used wget to prove you are
correct, there is no Content-Type header.

Thanks for the help!  I'll see if I can hack markserv to add that, and
try again.

Kevin


Re: Several critical vulnerabilities discovered in Apache Solr (XXE & RCE)

2017-10-13 Thread Rick Leir
Hi all,
What is the earliest version which was vulnerable?
Thanks -- Rick
-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
OK, so I hacked markserv to add Content-Type text/html, but now I get

SimplePostTool: WARNING: Skipping URL with unsupported type text/html

What is it expecting?

$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Skipping URL with unsupported type text/html
SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP 
result status of 415
0 web pages indexed.
COMMITting Solr index changes to 
http://localhost:8983/solr/handbook/update/extract...
Time spent: 0:00:03.882
$ 

Thanks.

Kevin


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Strange,

Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
Content-Type. Let's see what it says now.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:

> OK, so I hacked markserv to add Content-Type text/html, but now I get
>
> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>
> What is it expecting?
>
> $ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
> HTTP result status of 415
> 0 web pages indexed.
> COMMITting Solr index changes to http://localhost:8983/solr/
> handbook/update/extract...
> Time spent: 0:00:03.882
> $
>
> Thanks.
>
> Kevin
>


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Strange,
>> 
>> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> Content-Type. Let's see what it says now.

Same thing.  Verified Content-Type:

quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& grep 
Content-Type
  Content-Type: text/html;charset=utf-8
quadra[git:master]$ ]

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Skipping URL with unsupported type text/html
SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP 
result status of 415
0 web pages indexed.
COMMITting Solr index changes to 
http://localhost:8983/solr/handbook/update/extract...
Time spent: 0:00:00.531
quadra[git:master]$ 

Kevin

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
>> 
>> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >
>> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >
>> > What is it expecting?
>> >
>> > $ docker exec -it --user=solr solr bin/post -c handbook
>> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> > /docker-java-home/jre/bin/java -classpath 
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> > SimplePostTool version 5.0.0
>> > Posting web pages to Solr url http://localhost:8983/solr/
>> > handbook/update/extract
>> > Entering auto mode. Indexing pages with content-types corresponding to
>> > file endings md
>> > SimplePostTool: WARNING: Never crawl an external web site faster than
>> > every 10 seconds, your IP will probably be blocked
>> > Entering recursive mode, depth=10, delay=0s
>> > Entering crawl at level 0 (1 links total, 1 new)
>> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> > HTTP result status of 415
>> > 0 web pages indexed.
>> > COMMITting Solr index changes to http://localhost:8983/solr/
>> > handbook/update/extract...
>> > Time spent: 0:00:03.882
>> > $
>> >
>> > Thanks.
>> >
>> > Kevin
>> >


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Reference to the code:

.

String rawContentType = conn.getContentType();
String type = rawContentType.split(";")[0];
if(typeSupported(type) || "*".equals(fileTypes)) {
  String encoding = conn.getContentEncoding();

.

protected boolean typeSupported(String type) {
  for(String key : mimeMap.keySet()) {
if(mimeMap.get(key).equals(type)) {
  if(fileTypes.contains(key))
return true;
}
  }
  return false;
}

.

It has another check for fileTypes, I can see the page ending with .md
(which you are indexing) and not .html. Let's hope now this is not the
issue.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar 
wrote:

> Kevin,
>
> Just put "html" too and give it a shot. These are the types it is
> expecting:
>
> mimeMap = new HashMap<>();
> mimeMap.put("xml", "application/xml");
> mimeMap.put("csv", "text/csv");
> mimeMap.put("json", "application/json");
> mimeMap.put("jsonl", "application/json");
> mimeMap.put("pdf", "application/pdf");
> mimeMap.put("rtf", "text/rtf");
> mimeMap.put("html", "text/html");
> mimeMap.put("htm", "text/html");
> mimeMap.put("doc", "application/msword");
> mimeMap.put("docx", 
> "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> mimeMap.put("pptx", 
> "application/vnd.openxmlformats-officedocument.presentationml.presentation");
> mimeMap.put("xls", "application/vnd.ms-excel");
> mimeMap.put("xlsx", 
> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
> mimeMap.put("txt", "text/plain");
> mimeMap.put("log", "text/plain");
>
> The keys are the types supported.
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
> wrote:
>
>> Ah!
>>
>> Only supported type is: text/html; encoding=utf-8
>>
>> I am not confident of this either :) but this should work.
>>
>> See the code-snippet below:
>>
>> ..
>>
>> if(res.httpStatus == 200) {
>>   // Raw content type of form "text/html; encoding=utf-8"
>>   String rawContentType = conn.getContentType();
>>   String type = rawContentType.split(";")[0];
>>   if(typeSupported(type) || "*".equals(fileTypes)) {
>> String encoding = conn.getContentEncoding();
>>
>> 
>>
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>>
>>> Amrit Sarkar wrote:
>>>
>>> >> Strange,
>>> >>
>>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>>> page's
>>> >> Content-Type. Let's see what it says now.
>>>
>>> Same thing.  Verified Content-Type:
>>>
>>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>>> grep Content-Type
>>>   Content-Type: text/html;charset=utf-8
>>> quadra[git:master]$ ]
>>>
>>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
>>> md
>>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
>>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>>> SimplePostTool version 5.0.0
>>> Posting web pages to Solr url http://localhost:8983/solr/han
>>> dbook/update/extract
>>> Entering auto mode. Indexing pages with content-types corresponding to
>>> file endings md
>>> SimplePostTool: WARNING: Never crawl an external web site faster than
>>> every 10 seconds, your IP will probably be blocked
>>> Entering recursive mode, depth=10, delay=0s
>>> Entering crawl at level 0 (1 links total, 1 new)
>>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>>> HTTP result status of 415
>>> 0 web pages indexed.
>>> COMMITting Solr index changes to http://localhost:8983/solr/han
>>> dbook/update/extract...
>>> Time spent: 0:00:00.531
>>> quadra[git:master]$
>>>
>>> Kevin
>>>
>>> >>
>>> >> Amrit Sarkar
>>> >> Search Engineer
>>> >> Lucidworks, Inc.
>>> >> 415-589-9269
>>> >> www.lucidworks.com
>>> >> Twitter http://twitter.com/lucidworks
>>> >> LinkedI

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

Just put "html" too and give it a shot. These are the types it is expecting:

mimeMap = new HashMap<>();
mimeMap.put("xml", "application/xml");
mimeMap.put("csv", "text/csv");
mimeMap.put("json", "application/json");
mimeMap.put("jsonl", "application/json");
mimeMap.put("pdf", "application/pdf");
mimeMap.put("rtf", "text/rtf");
mimeMap.put("html", "text/html");
mimeMap.put("htm", "text/html");
mimeMap.put("doc", "application/msword");
mimeMap.put("docx",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document");
mimeMap.put("ppt", "application/vnd.ms-powerpoint");
mimeMap.put("pptx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation");
mimeMap.put("xls", "application/vnd.ms-excel");
mimeMap.put("xlsx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
mimeMap.put("txt", "text/plain");
mimeMap.put("log", "text/plain");

The keys are the types supported.


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
wrote:

> Ah!
>
> Only supported type is: text/html; encoding=utf-8
>
> I am not confident of this either :) but this should work.
>
> See the code-snippet below:
>
> ..
>
> if(res.httpStatus == 200) {
>   // Raw content type of form "text/html; encoding=utf-8"
>   String rawContentType = conn.getContentType();
>   String type = rawContentType.split(";")[0];
>   if(typeSupported(type) || "*".equals(fileTypes)) {
> String encoding = conn.getContentEncoding();
>
> 
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>
>> Amrit Sarkar wrote:
>>
>> >> Strange,
>> >>
>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> >> Content-Type. Let's see what it says now.
>>
>> Same thing.  Verified Content-Type:
>>
>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> grep Content-Type
>>   Content-Type: text/html;charset=utf-8
>> quadra[git:master]$ ]
>>
>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> SimplePostTool version 5.0.0
>> Posting web pages to Solr url http://localhost:8983/solr/han
>> dbook/update/extract
>> Entering auto mode. Indexing pages with content-types corresponding to
>> file endings md
>> SimplePostTool: WARNING: Never crawl an external web site faster than
>> every 10 seconds, your IP will probably be blocked
>> Entering recursive mode, depth=10, delay=0s
>> Entering crawl at level 0 (1 links total, 1 new)
>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> HTTP result status of 415
>> 0 web pages indexed.
>> COMMITting Solr index changes to http://localhost:8983/solr/han
>> dbook/update/extract...
>> Time spent: 0:00:00.531
>> quadra[git:master]$
>>
>> Kevin
>>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
>> >>
>> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >> >
>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> >
>> >> > What is it expecting?
>> >> >
>> >> > $ docker exec -it --user=solr solr bin/post -c handbook
>> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> > /docker-java-home/jre/bin/java -classpath
>> /opt/solr/dist/solr-core-7.0.1.jar
>> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> -Ddata=web
>> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >> > SimplePostTool version 5.0.0
>> >> > Posting web pages to Solr url http://localhost:8983/solr/
>> >> > handbook/update/extract
>> >> > Entering auto mode. Indexing pages with content-types corresponding
>> to
>> >> > file endings md
>> >> > SimplePostTool: WARNING: Never cra

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Ah!

Only supported type is: text/html; encoding=utf-8

I am not confident of this either :) but this should work.

See the code-snippet below:

..

if(res.httpStatus == 200) {
  // Raw content type of form "text/html; encoding=utf-8"
  String rawContentType = conn.getContentType();
  String type = rawContentType.split(";")[0];
  if(typeSupported(type) || "*".equals(fileTypes)) {
String encoding = conn.getContentEncoding();




Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Strange,
> >>
> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
> >> Content-Type. Let's see what it says now.
>
> Same thing.  Verified Content-Type:
>
> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
> grep Content-Type
>   Content-Type: text/html;charset=utf-8
> quadra[git:master]$ ]
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
> HTTP result status of 415
> 0 web pages indexed.
> COMMITting Solr index changes to http://localhost:8983/solr/
> handbook/update/extract...
> Time spent: 0:00:00.531
> quadra[git:master]$
>
> Kevin
>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
> >>
> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
> >> >
> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> >
> >> > What is it expecting?
> >> >
> >> > $ docker exec -it --user=solr solr bin/post -c handbook
> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> >> > /docker-java-home/jre/bin/java -classpath
> /opt/solr/dist/solr-core-7.0.1.jar
> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> -Ddata=web
> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> >> > SimplePostTool version 5.0.0
> >> > Posting web pages to Solr url http://localhost:8983/solr/
> >> > handbook/update/extract
> >> > Entering auto mode. Indexing pages with content-types corresponding to
> >> > file endings md
> >> > SimplePostTool: WARNING: Never crawl an external web site faster than
> >> > every 10 seconds, your IP will probably be blocked
> >> > Entering recursive mode, depth=10, delay=0s
> >> > Entering crawl at level 0 (1 links total, 1 new)
> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md
> returned a
> >> > HTTP result status of 415
> >> > 0 web pages indexed.
> >> > COMMITting Solr index changes to http://localhost:8983/solr/
> >> > handbook/update/extract...
> >> > Time spent: 0:00:03.882
> >> > $
> >> >
> >> > Thanks.
> >> >
> >> > Kevin
> >> >
>


Re: Appending fields to pre-existed document

2017-10-13 Thread Игорь Абрашин
Hi,
Yeah, sure but what exactly should i utilize?
Because as i see all of them require to use set or add in json..how can i
perform that from dataimport or posting file from curl? Also we've tried to
use version feature to combine both sources, but nothing at all


13 окт. 2017 г. 17:49 пользователь "alessandro.benedetti" <
a.benede...@sease.io> написал:

Hi,
"And all what we got
only a overwriting doc came first by new one. Ok just put overwrite=false
to params, and dublicating docs appeare."

What is exactly the doc you get ?
Are the fields originally in the first doc before the atomic update stored ?
This is what you need to use :

https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html

If you don't, Solr by default will just overwrite the entire document.




-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Reference to the code:
>> 
>> .
>> 
>> String rawContentType = conn.getContentType();
>> String type = rawContentType.split(";")[0];
>> if(typeSupported(type) || "*".equals(fileTypes)) {
>>   String encoding = conn.getContentEncoding();
>> 
>> .
>> 
>> protected boolean typeSupported(String type) {
>>   for(String key : mimeMap.keySet()) {
>> if(mimeMap.get(key).equals(type)) {
>>   if(fileTypes.contains(key))
>> return true;
>> }
>>   }
>>   return false;
>> }
>> 
>> .
>> 
>> It has another check for fileTypes, I can see the page ending with .md
>> (which you are indexing) and not .html. Let's hope now this is not the
>> issue.

Did you see the "-filetypes md" at the end of the post command line?
Shouldn't that handle it?

Kevin

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar 
>> wrote:
>> 
>> > Kevin,
>> >
>> > Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > mimeMap = new HashMap<>();
>> > mimeMap.put("xml", "application/xml");
>> > mimeMap.put("csv", "text/csv");
>> > mimeMap.put("json", "application/json");
>> > mimeMap.put("jsonl", "application/json");
>> > mimeMap.put("pdf", "application/pdf");
>> > mimeMap.put("rtf", "text/rtf");
>> > mimeMap.put("html", "text/html");
>> > mimeMap.put("htm", "text/html");
>> > mimeMap.put("doc", "application/msword");
>> > mimeMap.put("docx", 
>> > "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
>> > mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > mimeMap.put("pptx", 
>> > "application/vnd.openxmlformats-officedocument.presentationml.presentation");
>> > mimeMap.put("xls", "application/vnd.ms-excel");
>> > mimeMap.put("xlsx", 
>> > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("txt", "text/plain");
>> > mimeMap.put("log", "text/plain");
>> >
>> > The keys are the types supported.
>> >
>> >
>> > Amrit Sarkar
>> > Search Engineer
>> > Lucidworks, Inc.
>> > 415-589-9269
>> > www.lucidworks.com
>> > Twitter http://twitter.com/lucidworks
>> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >
>> > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > wrote:
>> >
>> >> Ah!
>> >>
>> >> Only supported type is: text/html; encoding=utf-8
>> >>
>> >> I am not confident of this either :) but this should work.
>> >>
>> >> See the code-snippet below:
>> >>
>> >> ..
>> >>
>> >> if(res.httpStatus == 200) {
>> >>   // Raw content type of form "text/html; encoding=utf-8"
>> >>   String rawContentType = conn.getContentType();
>> >>   String type = rawContentType.split(";")[0];
>> >>   if(typeSupported(type) || "*".equals(fileTypes)) {
>> >> String encoding = conn.getContentEncoding();
>> >>
>> >> 
>> >>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> >>
>> >>> Amrit Sarkar wrote:
>> >>>
>> >>> >> Strange,
>> >>> >>
>> >>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>> >>> page's
>> >>> >> Content-Type. Let's see what it says now.
>> >>>
>> >>> Same thing.  Verified Content-Type:
>> >>>
>> >>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> >>> grep Content-Type
>> >>>   Content-Type: text/html;charset=utf-8
>> >>> quadra[git:master]$ ]
>> >>>
>> >>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>> >>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
>> >>> md
>> >>> /docker-java-home/jre/bin/java -classpath 
>> >>> /opt/solr/dist/solr-core-7.0.1.jar
>> >>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook 
>> >>> -Ddata=web
>> >>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >>> SimplePostTool version 5.0.0
>> >>> Posting web pages to Solr url http://localhost:8983/solr/han
>> >>> dbook/update/extract
>> >>> Entering auto mode. Indexing pages with content-types corresponding to
>> >>> file endings md
>> >>> SimplePostTool: WARNING: Never crawl an external web site faster than
>> >>> every 10 seconds, your IP will probably be blocked
>> >>> Entering recursive mode, depth=10, delay=0s
>> >>> Entering crawl at level 0 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> Just put "html" too and give it a shot. These are the types it is expecting:

Same thing.

>> 
>> mimeMap = new HashMap<>();
>> mimeMap.put("xml", "application/xml");
>> mimeMap.put("csv", "text/csv");
>> mimeMap.put("json", "application/json");
>> mimeMap.put("jsonl", "application/json");
>> mimeMap.put("pdf", "application/pdf");
>> mimeMap.put("rtf", "text/rtf");
>> mimeMap.put("html", "text/html");
>> mimeMap.put("htm", "text/html");
>> mimeMap.put("doc", "application/msword");
>> mimeMap.put("docx",
>> "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
>> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> mimeMap.put("pptx",
>> "application/vnd.openxmlformats-officedocument.presentationml.presentation");
>> mimeMap.put("xls", "application/vnd.ms-excel");
>> mimeMap.put("xlsx",
>> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> mimeMap.put("txt", "text/plain");
>> mimeMap.put("log", "text/plain");
>> 
>> The keys are the types supported.
>> 
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> wrote:
>> 
>> > Ah!
>> >
>> > Only supported type is: text/html; encoding=utf-8
>> >
>> > I am not confident of this either :) but this should work.
>> >
>> > See the code-snippet below:
>> >
>> > ..
>> >
>> > if(res.httpStatus == 200) {
>> >   // Raw content type of form "text/html; encoding=utf-8"
>> >   String rawContentType = conn.getContentType();
>> >   String type = rawContentType.split(";")[0];
>> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > String encoding = conn.getContentEncoding();
>> >
>> > 
>> >
>> >
>> > Amrit Sarkar
>> > Search Engineer
>> > Lucidworks, Inc.
>> > 415-589-9269
>> > www.lucidworks.com
>> > Twitter http://twitter.com/lucidworks
>> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >
>> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> >
>> >> Amrit Sarkar wrote:
>> >>
>> >> >> Strange,
>> >> >>
>> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> >> >> Content-Type. Let's see what it says now.
>> >>
>> >> Same thing.  Verified Content-Type:
>> >>
>> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> >> grep Content-Type
>> >>   Content-Type: text/html;charset=utf-8
>> >> quadra[git:master]$ ]
>> >>
>> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> /docker-java-home/jre/bin/java -classpath 
>> >> /opt/solr/dist/solr-core-7.0.1.jar
>> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >> SimplePostTool version 5.0.0
>> >> Posting web pages to Solr url http://localhost:8983/solr/han
>> >> dbook/update/extract
>> >> Entering auto mode. Indexing pages with content-types corresponding to
>> >> file endings md
>> >> SimplePostTool: WARNING: Never crawl an external web site faster than
>> >> every 10 seconds, your IP will probably be blocked
>> >> Entering recursive mode, depth=10, delay=0s
>> >> Entering crawl at level 0 (1 links total, 1 new)
>> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> >> HTTP result status of 415
>> >> 0 web pages indexed.
>> >> COMMITting Solr index changes to http://localhost:8983/solr/han
>> >> dbook/update/extract...
>> >> Time spent: 0:00:00.531
>> >> quadra[git:master]$
>> >>
>> >> Kevin
>> >>
>> >> >>
>> >> >> Amrit Sarkar
>> >> >> Search Engineer
>> >> >> Lucidworks, Inc.
>> >> >> 415-589-9269
>> >> >> www.lucidworks.com
>> >> >> Twitter http://twitter.com/lucidworks
>> >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >> >>
>> >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
>> >> >>
>> >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >> >> >
>> >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> >> >
>> >> >> > What is it expecting?
>> >> >> >
>> >> >> > $ docker exec -it --user=solr solr bin/post -c handbook
>> >> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> >> > /docker-java-home/jre/bin/java -classpath
>> >> /opt/solr/dist/solr-core-7.0.

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Hi Kevin,

Can you post the solr log in the mail thread. I don't think it handled the
.md by itself by first glance at code.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Kevin,
> >>
> >> Just put "html" too and give it a shot. These are the types it is
> expecting:
>
> Same thing.
>
> >>
> >> mimeMap = new HashMap<>();
> >> mimeMap.put("xml", "application/xml");
> >> mimeMap.put("csv", "text/csv");
> >> mimeMap.put("json", "application/json");
> >> mimeMap.put("jsonl", "application/json");
> >> mimeMap.put("pdf", "application/pdf");
> >> mimeMap.put("rtf", "text/rtf");
> >> mimeMap.put("html", "text/html");
> >> mimeMap.put("htm", "text/html");
> >> mimeMap.put("doc", "application/msword");
> >> mimeMap.put("docx",
> >> "application/vnd.openxmlformats-officedocument.
> wordprocessingml.document");
> >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> >> mimeMap.put("pptx",
> >> "application/vnd.openxmlformats-officedocument.
> presentationml.presentation");
> >> mimeMap.put("xls", "application/vnd.ms-excel");
> >> mimeMap.put("xlsx",
> >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
> >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
> >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
> >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
> >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
> >> mimeMap.put("txt", "text/plain");
> >> mimeMap.put("log", "text/plain");
> >>
> >> The keys are the types supported.
> >>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
> >> wrote:
> >>
> >> > Ah!
> >> >
> >> > Only supported type is: text/html; encoding=utf-8
> >> >
> >> > I am not confident of this either :) but this should work.
> >> >
> >> > See the code-snippet below:
> >> >
> >> > ..
> >> >
> >> > if(res.httpStatus == 200) {
> >> >   // Raw content type of form "text/html; encoding=utf-8"
> >> >   String rawContentType = conn.getContentType();
> >> >   String type = rawContentType.split(";")[0];
> >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
> >> > String encoding = conn.getContentEncoding();
> >> >
> >> > 
> >> >
> >> >
> >> > Amrit Sarkar
> >> > Search Engineer
> >> > Lucidworks, Inc.
> >> > 415-589-9269
> >> > www.lucidworks.com
> >> > Twitter http://twitter.com/lucidworks
> >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> >
> >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
> >> >
> >> >> Amrit Sarkar wrote:
> >> >>
> >> >> >> Strange,
> >> >> >>
> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
> page's
> >> >> >> Content-Type. Let's see what it says now.
> >> >>
> >> >> Same thing.  Verified Content-Type:
> >> >>
> >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md
> |&
> >> >> grep Content-Type
> >> >>   Content-Type: text/html;charset=utf-8
> >> >> quadra[git:master]$ ]
> >> >>
> >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
> handbook
> >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> >> >> /docker-java-home/jre/bin/java -classpath
> /opt/solr/dist/solr-core-7.0.1.jar
> >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> -Ddata=web
> >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> >> >> SimplePostTool version 5.0.0
> >> >> Posting web pages to Solr url http://localhost:8983/solr/han
> >> >> dbook/update/extract
> >> >> Entering auto mode. Indexing pages with content-types corresponding
> to
> >> >> file endings md
> >> >> SimplePostTool: WARNING: Never crawl an external web site faster than
> >> >> every 10 seconds, your IP will probably be blocked
> >> >> Entering recursive mode, depth=10, delay=0s
> >> >> Entering crawl at level 0 (1 links total, 1 new)
> >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md
> returned a
> >> >> HTTP result status of 415
> >> >> 0 web pages indexed.
> >> >> COMMITting Solr index changes to http://localhost:8983/solr/han
> >> >> dbook/update/extract...
> >> >> Time spent: 0:00:00.531
> >> >> quadra[git:master]$
> >> >>
> >> >> Kevin
> >> >>
> >> >> >>
> >> >> >> Amrit Sarkar
> >> >> >> Search Engineer
> >> >> >> Lucidworks, Inc.
> >> >> >> 415-589-9269
> >> >> >> www.lucidworks.com
> >> >> >> Twitter http://twitter.com/lucid

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Hi Kevin,
>> 
>> Can you post the solr log in the mail thread. I don't think it handled the
>> .md by itself by first glance at code.

How do I extract the log you want?


>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > Same thing.
>> >
>> > >>
>> > >> mimeMap = new HashMap<>();
>> > >> mimeMap.put("xml", "application/xml");
>> > >> mimeMap.put("csv", "text/csv");
>> > >> mimeMap.put("json", "application/json");
>> > >> mimeMap.put("jsonl", "application/json");
>> > >> mimeMap.put("pdf", "application/pdf");
>> > >> mimeMap.put("rtf", "text/rtf");
>> > >> mimeMap.put("html", "text/html");
>> > >> mimeMap.put("htm", "text/html");
>> > >> mimeMap.put("doc", "application/msword");
>> > >> mimeMap.put("docx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > wordprocessingml.document");
>> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > >> mimeMap.put("pptx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > presentationml.presentation");
>> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> > >> mimeMap.put("xlsx",
>> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("txt", "text/plain");
>> > >> mimeMap.put("log", "text/plain");
>> > >>
>> > >> The keys are the types supported.
>> > >>
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > >> wrote:
>> > >>
>> > >> > Ah!
>> > >> >
>> > >> > Only supported type is: text/html; encoding=utf-8
>> > >> >
>> > >> > I am not confident of this either :) but this should work.
>> > >> >
>> > >> > See the code-snippet below:
>> > >> >
>> > >> > ..
>> > >> >
>> > >> > if(res.httpStatus == 200) {
>> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> > >> >   String rawContentType = conn.getContentType();
>> > >> >   String type = rawContentType.split(";")[0];
>> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > >> > String encoding = conn.getContentEncoding();
>> > >> >
>> > >> > 
>> > >> >
>> > >> >
>> > >> > Amrit Sarkar
>> > >> > Search Engineer
>> > >> > Lucidworks, Inc.
>> > >> > 415-589-9269
>> > >> > www.lucidworks.com
>> > >> > Twitter http://twitter.com/lucidworks
>> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >> >
>> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> > >> >
>> > >> >> Amrit Sarkar wrote:
>> > >> >>
>> > >> >> >> Strange,
>> > >> >> >>
>> > >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>> > page's
>> > >> >> >> Content-Type. Let's see what it says now.
>> > >> >>
>> > >> >> Same thing.  Verified Content-Type:
>> > >> >>
>> > >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md
>> > |&
>> > >> >> grep Content-Type
>> > >> >>   Content-Type: text/html;charset=utf-8
>> > >> >> quadra[git:master]$ ]
>> > >> >>
>> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>> > handbook
>> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> > >> >> /docker-java-home/jre/bin/java -classpath
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> > -Ddata=web
>> > >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> > >> >> SimplePostTool version 5.0.0
>> > >> >> Posting web pages to Solr url http://localhost:8983/solr/han
>> > >> >> dbook/update/extract
>> > >> >> Entering auto mode. Indexing pages with content-types corresponding
>> > to
>> > >> >> file endings md
>> > >> >> SimplePostTool: WARNING: Never crawl an external web site faster than
>> > >> >> every 10 seconds, your IP will probably be blocked
>> > >> >> Entering recursive mode, depth=10, delay=0s
>> > >> >> Entering crawl at level 0 (1 links total, 1 new)
>> > >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> > >> >> SimplePostTool: WARNING: The URL http://quadra:9091/index

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Hi Kevin,
>> 
>> Can you post the solr log in the mail thread. I don't think it handled the
>> .md by itself by first glance at code.

Note that when I use the admin web interface, and click on "Logging"
on the left, I just see a spinner that implies it's trying to retrieve
the logs (I see headers "Time (Local)   Level   CoreLogger  Message"),
but no log entries.  It's been like this for 10 minutes.

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > Same thing.
>> >
>> > >>
>> > >> mimeMap = new HashMap<>();
>> > >> mimeMap.put("xml", "application/xml");
>> > >> mimeMap.put("csv", "text/csv");
>> > >> mimeMap.put("json", "application/json");
>> > >> mimeMap.put("jsonl", "application/json");
>> > >> mimeMap.put("pdf", "application/pdf");
>> > >> mimeMap.put("rtf", "text/rtf");
>> > >> mimeMap.put("html", "text/html");
>> > >> mimeMap.put("htm", "text/html");
>> > >> mimeMap.put("doc", "application/msword");
>> > >> mimeMap.put("docx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > wordprocessingml.document");
>> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > >> mimeMap.put("pptx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > presentationml.presentation");
>> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> > >> mimeMap.put("xlsx",
>> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("txt", "text/plain");
>> > >> mimeMap.put("log", "text/plain");
>> > >>
>> > >> The keys are the types supported.
>> > >>
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > >> wrote:
>> > >>
>> > >> > Ah!
>> > >> >
>> > >> > Only supported type is: text/html; encoding=utf-8
>> > >> >
>> > >> > I am not confident of this either :) but this should work.
>> > >> >
>> > >> > See the code-snippet below:
>> > >> >
>> > >> > ..
>> > >> >
>> > >> > if(res.httpStatus == 200) {
>> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> > >> >   String rawContentType = conn.getContentType();
>> > >> >   String type = rawContentType.split(";")[0];
>> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > >> > String encoding = conn.getContentEncoding();
>> > >> >
>> > >> > 
>> > >> >
>> > >> >
>> > >> > Amrit Sarkar
>> > >> > Search Engineer
>> > >> > Lucidworks, Inc.
>> > >> > 415-589-9269
>> > >> > www.lucidworks.com
>> > >> > Twitter http://twitter.com/lucidworks
>> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >> >
>> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> > >> >
>> > >> >> Amrit Sarkar wrote:
>> > >> >>
>> > >> >> >> Strange,
>> > >> >> >>
>> > >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>> > page's
>> > >> >> >> Content-Type. Let's see what it says now.
>> > >> >>
>> > >> >> Same thing.  Verified Content-Type:
>> > >> >>
>> > >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md
>> > |&
>> > >> >> grep Content-Type
>> > >> >>   Content-Type: text/html;charset=utf-8
>> > >> >> quadra[git:master]$ ]
>> > >> >>
>> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>> > handbook
>> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> > >> >> /docker-java-home/jre/bin/java -classpath
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> > -Ddata=web
>> > >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> > >> >> SimplePostTool version 5.0.0
>> > >> >> Posting web pages to Solr url http://localhost:8983/solr/han
>> > >> >> dbook/update/extract
>> > >> >> Entering auto mode. Indexing pages with content-types corresponding
>> > to
>> > >> >> file endings md
>> > >> >> SimplePostTool: WARNING: Never crawl an external web site faster than
>> > >> >> every 10 seconds, your IP will probably be blocked
>> > >> >> Entering recursiv

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in
the machine. I haven't played much with docker, any way you can get that
file from that location.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Hi Kevin,
> >>
> >> Can you post the solr log in the mail thread. I don't think it handled
> the
> >> .md by itself by first glance at code.
>
> How do I extract the log you want?
>
>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
> >>
> >> > Amrit Sarkar wrote:
> >> >
> >> > >> Kevin,
> >> > >>
> >> > >> Just put "html" too and give it a shot. These are the types it is
> >> > expecting:
> >> >
> >> > Same thing.
> >> >
> >> > >>
> >> > >> mimeMap = new HashMap<>();
> >> > >> mimeMap.put("xml", "application/xml");
> >> > >> mimeMap.put("csv", "text/csv");
> >> > >> mimeMap.put("json", "application/json");
> >> > >> mimeMap.put("jsonl", "application/json");
> >> > >> mimeMap.put("pdf", "application/pdf");
> >> > >> mimeMap.put("rtf", "text/rtf");
> >> > >> mimeMap.put("html", "text/html");
> >> > >> mimeMap.put("htm", "text/html");
> >> > >> mimeMap.put("doc", "application/msword");
> >> > >> mimeMap.put("docx",
> >> > >> "application/vnd.openxmlformats-officedocument.
> >> > wordprocessingml.document");
> >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> >> > >> mimeMap.put("pptx",
> >> > >> "application/vnd.openxmlformats-officedocument.
> >> > presentationml.presentation");
> >> > >> mimeMap.put("xls", "application/vnd.ms-excel");
> >> > >> mimeMap.put("xlsx",
> >> > >> "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet");
> >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> >> > >> mimeMap.put("odp", "application/vnd.oasis.
> opendocument.presentation");
> >> > >> mimeMap.put("otp", "application/vnd.oasis.
> opendocument.presentation");
> >> > >> mimeMap.put("ods", "application/vnd.oasis.
> opendocument.spreadsheet");
> >> > >> mimeMap.put("ots", "application/vnd.oasis.
> opendocument.spreadsheet");
> >> > >> mimeMap.put("txt", "text/plain");
> >> > >> mimeMap.put("log", "text/plain");
> >> > >>
> >> > >> The keys are the types supported.
> >> > >>
> >> > >>
> >> > >> Amrit Sarkar
> >> > >> Search Engineer
> >> > >> Lucidworks, Inc.
> >> > >> 415-589-9269
> >> > >> www.lucidworks.com
> >> > >> Twitter http://twitter.com/lucidworks
> >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> > >>
> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <
> sarkaramr...@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >> > Ah!
> >> > >> >
> >> > >> > Only supported type is: text/html; encoding=utf-8
> >> > >> >
> >> > >> > I am not confident of this either :) but this should work.
> >> > >> >
> >> > >> > See the code-snippet below:
> >> > >> >
> >> > >> > ..
> >> > >> >
> >> > >> > if(res.httpStatus == 200) {
> >> > >> >   // Raw content type of form "text/html; encoding=utf-8"
> >> > >> >   String rawContentType = conn.getContentType();
> >> > >> >   String type = rawContentType.split(";")[0];
> >> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
> >> > >> > String encoding = conn.getContentEncoding();
> >> > >> >
> >> > >> > 
> >> > >> >
> >> > >> >
> >> > >> > Amrit Sarkar
> >> > >> > Search Engineer
> >> > >> > Lucidworks, Inc.
> >> > >> > 415-589-9269
> >> > >> > www.lucidworks.com
> >> > >> > Twitter http://twitter.com/lucidworks
> >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> > >> >
> >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer 
> wrote:
> >> > >> >
> >> > >> >> Amrit Sarkar wrote:
> >> > >> >>
> >> > >> >> >> Strange,
> >> > >> >> >>
> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is
> wiki.apache.org
> >> > page's
> >> > >> >> >> Content-Type. Let's see what it says now.
> >> > >> >>
> >> > >> >> Same thing.  Verified Content-Type:
> >> > >> >>
> >> > >> >> quadra[git:master]$ wget -S -O /dev/null
> http://quadra:9091/index.md
> >> > |&
> >> > >> >> grep Content-Type
> >> > >> >>   Content-Type: text/html;charset=utf-8
> >> > >> >> quadra[git:master]$ ]
> >> > >> >>
> >> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
> >> > handbook
> >> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
> md
> >> > >> >> /docker-java-home/jre/bin/java -classpath
> >> > /opt/solr/dist/solr-core-7.0.1.jar
> >> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> >> > -Ddata=web
> >> > >> >> org.apache.solr.util.SimplePostTool

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
pardon: [solr-home]/server/log/solr.log

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:10 PM, Amrit Sarkar 
wrote:

> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in
> the machine. I haven't played much with docker, any way you can get that
> file from that location.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layer  wrote:
>
>> Amrit Sarkar wrote:
>>
>> >> Hi Kevin,
>> >>
>> >> Can you post the solr log in the mail thread. I don't think it handled
>> the
>> >> .md by itself by first glance at code.
>>
>> How do I extract the log you want?
>>
>>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> >>
>> >> > Amrit Sarkar wrote:
>> >> >
>> >> > >> Kevin,
>> >> > >>
>> >> > >> Just put "html" too and give it a shot. These are the types it is
>> >> > expecting:
>> >> >
>> >> > Same thing.
>> >> >
>> >> > >>
>> >> > >> mimeMap = new HashMap<>();
>> >> > >> mimeMap.put("xml", "application/xml");
>> >> > >> mimeMap.put("csv", "text/csv");
>> >> > >> mimeMap.put("json", "application/json");
>> >> > >> mimeMap.put("jsonl", "application/json");
>> >> > >> mimeMap.put("pdf", "application/pdf");
>> >> > >> mimeMap.put("rtf", "text/rtf");
>> >> > >> mimeMap.put("html", "text/html");
>> >> > >> mimeMap.put("htm", "text/html");
>> >> > >> mimeMap.put("doc", "application/msword");
>> >> > >> mimeMap.put("docx",
>> >> > >> "application/vnd.openxmlformats-officedocument.
>> >> > wordprocessingml.document");
>> >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> >> > >> mimeMap.put("pptx",
>> >> > >> "application/vnd.openxmlformats-officedocument.
>> >> > presentationml.presentation");
>> >> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> >> > >> mimeMap.put("xlsx",
>> >> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml
>> .sheet");
>> >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> >> > >> mimeMap.put("odp", "application/vnd.oasis.opendoc
>> ument.presentation");
>> >> > >> mimeMap.put("otp", "application/vnd.oasis.opendoc
>> ument.presentation");
>> >> > >> mimeMap.put("ods", "application/vnd.oasis.opendoc
>> ument.spreadsheet");
>> >> > >> mimeMap.put("ots", "application/vnd.oasis.opendoc
>> ument.spreadsheet");
>> >> > >> mimeMap.put("txt", "text/plain");
>> >> > >> mimeMap.put("log", "text/plain");
>> >> > >>
>> >> > >> The keys are the types supported.
>> >> > >>
>> >> > >>
>> >> > >> Amrit Sarkar
>> >> > >> Search Engineer
>> >> > >> Lucidworks, Inc.
>> >> > >> 415-589-9269
>> >> > >> www.lucidworks.com
>> >> > >> Twitter http://twitter.com/lucidworks
>> >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >> > >>
>> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <
>> sarkaramr...@gmail.com>
>> >> > >> wrote:
>> >> > >>
>> >> > >> > Ah!
>> >> > >> >
>> >> > >> > Only supported type is: text/html; encoding=utf-8
>> >> > >> >
>> >> > >> > I am not confident of this either :) but this should work.
>> >> > >> >
>> >> > >> > See the code-snippet below:
>> >> > >> >
>> >> > >> > ..
>> >> > >> >
>> >> > >> > if(res.httpStatus == 200) {
>> >> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> >> > >> >   String rawContentType = conn.getContentType();
>> >> > >> >   String type = rawContentType.split(";")[0];
>> >> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> >> > >> > String encoding = conn.getContentEncoding();
>> >> > >> >
>> >> > >> > 
>> >> > >> >
>> >> > >> >
>> >> > >> > Amrit Sarkar
>> >> > >> > Search Engineer
>> >> > >> > Lucidworks, Inc.
>> >> > >> > 415-589-9269
>> >> > >> > www.lucidworks.com
>> >> > >> > Twitter http://twitter.com/lucidworks
>> >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >> > >> >
>> >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer 
>> wrote:
>> >> > >> >
>> >> > >> >> Amrit Sarkar wrote:
>> >> > >> >>
>> >> > >> >> >> Strange,
>> >> > >> >> >>
>> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is
>> wiki.apache.org
>> >> > page's
>> >> > >> >> >> Content-Type. Let's see what it says now.
>> >> > >> >>
>> >> > >> >> Same thing.  Verified Content-Type:
>> >> > >> >>
>> >> > >> >> quadra[git:master]$ wget -S -O /dev/null
>> http://quadra:9091/index.md
>> >> > |&
>> >> > >> >> grep Content-Type
>> >> > >> >>   Content-Type: text/html;charset=utf-8
>> >> > >> >> quadra[git:master]$ ]
>> >> > >

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in
>> the machine. I haven't played much with docker, any way you can get that
>> file from that location.

I see these files:

/opt/solr/server/logs/archived
/opt/solr/server/logs/solr_gc.log.0.current
/opt/solr/server/logs/solr.log
/opt/solr/server/solr/handbook/data/tlog

The 3rd one has very little info.  Attached:

2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server jetty-9.3.14.v20161028
2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___  
_   Welcome to Apache Solr™ version 7.0.1
2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __| 
___| |_ _   Starting in standalone mode on port 8983
2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__ \/ _ 
\ | '_|  Install dir: /opt/solr, Default config dir: 
/opt/solr/server/solr/configsets/_default/conf
2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter 
|___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z
2017-10-11 15:28:10.747 INFO  (main) [   ] o.a.s.c.SolrResourceLoader Using 
system property solr.solr.home: /opt/solr/server/solr
2017-10-11 15:28:10.763 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading 
container configuration from /opt/solr/server/solr/solr.xml
2017-10-11 15:28:11.062 INFO  (main) [   ] o.a.s.c.SolrResourceLoader [null] 
Added 0 libs to classloader, from paths: []
2017-10-11 15:28:12.514 INFO  (main) [   ] o.a.s.c.CorePropertiesLocator Found 
0 core definitions underneath /opt/solr/server/solr
2017-10-11 15:28:12.635 INFO  (main) [   ] o.e.j.s.Server Started @4304ms
2017-10-11 15:29:00.971 INFO  (qtp1911006827-13) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/system params={wt=json} status=0 QTime=108
2017-10-11 15:29:01.080 INFO  (qtp1911006827-18) [   ] 
o.a.s.c.TransientSolrCoreCacheDefault Allocating transient cache for 2147483647 
transient cores
2017-10-11 15:29:01.083 INFO  (qtp1911006827-18) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/cores 
params={core=handbook&action=STATUS&wt=json} status=0 QTime=5
2017-10-11 15:29:01.194 INFO  (qtp1911006827-19) [   ] 
o.a.s.h.a.CoreAdminOperation core create command 
name=handbook&action=CREATE&instanceDir=handbook&wt=json
2017-10-11 15:29:01.342 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrResourceLoader [handbook] Added 51 libs to classloader, from paths: 
[/opt/solr/contrib/clustering/lib, /opt/solr/contrib/extraction/lib, 
/opt/solr/contrib/langid/lib, /opt/solr/contrib/velocity/lib, /opt/solr/dist]
2017-10-11 15:29:01.504 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrConfig Using Lucene MatchVersion: 7.0.1
2017-10-11 15:29:01.969 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.IndexSchema [handbook] Schema name=default-config
2017-10-11 15:29:03.678 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.IndexSchema Loaded schema default-config/1.6 with uniqueid field id
2017-10-11 15:29:03.806 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.CoreContainer Creating SolrCore 'handbook' using configuration from 
instancedir /opt/solr/server/solr/handbook, trusted=true
2017-10-11 15:29:03.853 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrCore solr.RecoveryStrategy.Builder
2017-10-11 15:29:03.866 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrCore [[handbook] ] Opening new SolrCore at 
[/opt/solr/server/solr/handbook], dataDir=[/opt/solr/server/solr/handbook/data/]
2017-10-11 15:29:04.180 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.r.XSLTResponseWriter xsltCacheLifetimeSeconds=5
2017-10-11 15:29:05.100 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.UpdateHandler Using UpdateLog implementation: 
org.apache.solr.update.UpdateLog
2017-10-11 15:29:05.101 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH 
numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536
2017-10-11 15:29:05.150 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.CommitTracker Hard AutoCommit: if uncommited for 15000ms; 
2017-10-11 15:29:05.151 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.CommitTracker Soft AutoCommit: disabled
2017-10-11 15:29:05.199 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.SolrIndexSearcher Opening [Searcher@2b9fd97b[handbook] main]
2017-10-11 15:29:05.229 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.r.ManagedResourceStorage File-based storage initialized to use dir: 
/opt/solr/server/solr/handbook/conf
2017-10-11 15:29:05.266 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.h.c.SpellCheckComponent Initializing spell checkers
2017-10-11 15:29:05.283 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.DirectSolrSpellChecker init: 
{name=default,field=_text_,classname=solr.DirectSolrSpellChecker,distanceMeasure=internal,accuracy=0.5,maxEdits=2,minPrefix=1,maxInspections=5,minQueryLength=4,maxQueryFrequency=0.01}
2017-10-11 15:29:05.318 INFO  (qtp1911006827-19)

Re: is there a way to remove deleted documents from index without optimize

2017-10-13 Thread Harry Yoo
Thanks for the clarification. 

I use 


${lucene.version}

in the solrconfig.xml  and pass -Dlucene.version when I launch solr, to keep 
the versions.



> On Oct 12, 2017, at 11:01 PM, Erick Erickson  wrote:
> 
> You can use the IndexUpgradeTool that ships with each version of Solr
> (well, actually Lucene) to, well, upgrade your index. So you can use
> the IndexUpgradeTool that ships with 5x to upgrade from 4x. And the
> one that ships with 6x to upgrade from 5x. etc.
> 
> That said, none of that is necessary _if_ you
>> have the Lucene version in solrconfig.xml be the one that corresponds to 
>> your current Solr. I.e. a solrconfig for 6x should have a luceneMatchVersion 
>> of 6something.
>> you update your index enough to rewrite all segments before moving to the 
>> _next_ version. When Lucene sees merges a segment, it writes the new segment 
>> according to the luceneMatchVersion in solrconfig.xml. So as long as you are 
>> on a version long enough for all segments to be merged into new segments, 
>> you don't have to worry.
> 
> Best,
> Erick
> 
> On Thu, Oct 12, 2017 at 8:29 PM, Harry Yoo  wrote:
>> I should have read this. My project has been running from apache solr 4.x, 
>> and moved to 5.x and recently migrated to 6.6.1. Do you think solr will take 
>> care of old version indexes as well? I wanted to make sure my indexes are 
>> updated with 6.x lucence version so that it will be supported when i move to 
>> solr 7.x
>> 
>> Is there any best practice managing solr indexes?
>> 
>> Harry
>> 
>>> On Sep 22, 2015, at 8:21 PM, Walter Underwood  wrote:
>>> 
>>> Don’t do anything. Solr will automatically clean up the deleted documents 
>>> for you.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Sep 22, 2015, at 6:01 PM, CrazyDiamond  wrote:
 
 my index is updating frequently and i need to remove unused documents from
 index after update/reindex.
 Optimizaion is very expensive so what should i do?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/is-there-a-way-to-remove-deleted-documents-from-index-without-optimize-tp4230691.html
 Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>> 



Re: Concern on solr commit

2017-10-13 Thread Erick Erickson
Emir is spot on. Here's another thing though. You cannot see a
document until after the commit happens as you well know. Pay close
attention to this part of the error message:

"Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
try again later."

The commit generating this error doesn't open a new searcher, so upon
return you can't necessarily search on the document anyway because
opening the searcher aborted. It's only because you're hammering Solr
with commits from everywhere else that you ever see documents
committed when this error is generated.

In fact, theoretically you could _never_ see the document. Consider
that the client that sends the commit that generates this error
> happens to be the last one that ever sends a doc
> gets this error.

In this case since the open searcher failed, the doc(s) will be
invisible until the next commit happens. So committing from the client
isn't guaranteeing what you think anyway.

bq: 100-500 commits can happen simultaneously from application and autocommit
This is totally an anti-parttern in Solr. Don't do it. Really

bq: We have specific near realtime search requirements

I find it hard to believe that, say, a 1 second latency is
unacceptable. Set your soft commit in solrconfig.xml to 1 second and
stop firing commits from the client and see if anyone notices, I bet
not. And as I mentioned above, it's only by luck that you can see them
since the rejected openSearcher happening on commit means it's not
guaranteed that you can search on the doc anyway.

So when you get into the discussion about "we have to be absolutely
positively certain that we can see the doc instantly after it's been
added" remember that you're not getting that currently. If it's some
absolute requirement that "after the indexing call returns we must be
able to search for the document", then stick in a 1-2 second delay
after sending the doc to Solr in the client.

And I suspect you're doing a "hard commit" from the client also, which
is more expensive than a soft commit thus contributing to latency.

Best,
Erick



On Fri, Oct 13, 2017 at 4:17 AM, Emir Arnautović
 wrote:
> Hi Leo,
> Errors that you are seeing are related to frequent commits - new commits is 
> issued before searcher for previous commit is opened and warmed.
>
> I haven’t looked in indexing code in a while, but if assume that it did not 
> change, commits and writes are mutually exclusive - guarded by the same lock. 
> So yes - frequent commits will result in longer index request latency. Also 
> each commit results in new segment and segments are merged - small segment 
> and merge is fast, but still overhead.
>
> I would suggest that you change explicit commits with at least 
> commitWithin=100 (or as much as possible) - that will result in at least 
> 100ms worth of documents being grouped into single commit.
>
> Also, if you are committing this frequently, think of turning off your caches 
> since they are invalidated on each commit.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 13 Oct 2017, at 13:06, Leo Prince  wrote:
>>
>> Hi Emir,
>>
>> Thanks for the response.
>>
>> We have specific near realtime search requirements, that is why we are
>> explicitly invoking Solr commits. However we will try to improve on
>> reducing the commits from application. In the meantime, the errors/warnings
>> I mentioned in my previous mail; are they really due to frequent commits..?
>>
>> We are having plenty of requests taking more seconds to deliver from solr
>> irrespective of network overhead, so any thoughts whether commit frequency
>> affects solr latency..?
>>
>> Thanks,
>> Leo Prince
>>
>> On Fri, Oct 13, 2017 at 2:46 PM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>>
>>> Hi Leo,
>>> It is considered a bad practice to commit from your application. You
>>> should let Solr handle commits. There is a great article about soft and
>>> hard commits: https://lucidworks.com/2013/08/23/understanding-
>>> transaction-logs-softcommit-and-commit-in-sorlcloud/ <
>>> https://lucidworks.com/2013/08/23/understanding-
>>> transaction-logs-softcommit-and-commit-in-sorlcloud/>
>>>
>>> If you really want to commit from your application, then you should use
>>> commitWithin parameter that would group your commits in a single commit,
>>> and you would be sure that your changes are committed within some time (you
>>> can do that with autocommit as well).
>>>
>>> Opening searcher can be fast but it can also last for a while - it depends
>>> on warming parameters that you set. In any case, I would not recommend that
>>> you focus on making opening last less, but to see what are your NRT
>>> requirements and commit as rare as possible.
>>>
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

I am not able to replicate the issue on my system, which is bit annoying
for me. Try this out for last time:

docker exec -it --user=solr solr bin/post -c handbook
http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html

and have Content-Type: "html" and "text/html", try with both.

If you get past this hurdle this hurdle, let me know.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log
> in
> >> the machine. I haven't played much with docker, any way you can get that
> >> file from that location.
>
> I see these files:
>
> /opt/solr/server/logs/archived
> /opt/solr/server/logs/solr_gc.log.0.current
> /opt/solr/server/logs/solr.log
> /opt/solr/server/solr/handbook/data/tlog
>
> The 3rd one has very little info.  Attached:
>
>
> 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
> jetty-9.3.14.v20161028
> 2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> ___  _   Welcome to Apache Solr™ version 7.0.1
> 2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter /
> __| ___| |_ _   Starting in standalone mode on port 8983
> 2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__
> \/ _ \ | '_|  Install dir: /opt/solr, Default config dir:
> /opt/solr/server/solr/configsets/_default/conf
> 2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z
> 2017-10-11 15:28:10.747 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
> Using system property solr.solr.home: /opt/solr/server/solr
> 2017-10-11 15:28:10.763 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading
> container configuration from /opt/solr/server/solr/solr.xml
> 2017-10-11 15:28:11.062 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
> [null] Added 0 libs to classloader, from paths: []
> 2017-10-11 15:28:12.514 INFO  (main) [   ] o.a.s.c.CorePropertiesLocator
> Found 0 core definitions underneath /opt/solr/server/solr
> 2017-10-11 15:28:12.635 INFO  (main) [   ] o.e.j.s.Server Started @4304ms
> 2017-10-11 15:29:00.971 INFO  (qtp1911006827-13) [   ]
> o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system
> params={wt=json} status=0 QTime=108
> 2017-10-11 15:29:01.080 INFO  (qtp1911006827-18) [   ] 
> o.a.s.c.TransientSolrCoreCacheDefault
> Allocating transient cache for 2147483647 transient cores
> 2017-10-11 15:29:01.083 INFO  (qtp1911006827-18) [   ]
> o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores
> params={core=handbook&action=STATUS&wt=json} status=0 QTime=5
> 2017-10-11 15:29:01.194 INFO  (qtp1911006827-19) [   ]
> o.a.s.h.a.CoreAdminOperation core create command
> name=handbook&action=CREATE&instanceDir=handbook&wt=json
> 2017-10-11 15:29:01.342 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrResourceLoader [handbook] Added 51 libs to classloader, from
> paths: [/opt/solr/contrib/clustering/lib, /opt/solr/contrib/extraction/lib,
> /opt/solr/contrib/langid/lib, /opt/solr/contrib/velocity/lib,
> /opt/solr/dist]
> 2017-10-11 15:29:01.504 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrConfig Using Lucene MatchVersion: 7.0.1
> 2017-10-11 15:29:01.969 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.s.IndexSchema [handbook] Schema name=default-config
> 2017-10-11 15:29:03.678 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.s.IndexSchema Loaded schema default-config/1.6 with uniqueid field id
> 2017-10-11 15:29:03.806 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.CoreContainer Creating SolrCore 'handbook' using configuration from
> instancedir /opt/solr/server/solr/handbook, trusted=true
> 2017-10-11 15:29:03.853 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrCore solr.RecoveryStrategy.Builder
> 2017-10-11 15:29:03.866 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrCore [[handbook] ] Opening new SolrCore at
> [/opt/solr/server/solr/handbook], dataDir=[/opt/solr/server/
> solr/handbook/data/]
> 2017-10-11 15:29:04.180 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.r.XSLTResponseWriter xsltCacheLifetimeSeconds=5
> 2017-10-11 15:29:05.100 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.UpdateHandler Using UpdateLog implementation:
> org.apache.solr.update.UpdateLog
> 2017-10-11 15:29:05.101 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH
> numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536
> 2017-10-11 15:29:05.150 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.CommitTracker Hard AutoCommit: if uncommited for 15000ms;
> 2017-10-11 15:29:05.151 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.CommitTracker Soft AutoCommit: disabled
> 2017-10-11 15:29:05.199 INFO  (qtp1911006827-19) [   x:ha

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> I am not able to replicate the issue on my system, which is bit annoying
>> for me. Try this out for last time:
>> 
>> docker exec -it --user=solr solr bin/post -c handbook
>> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html
>> 
>> and have Content-Type: "html" and "text/html", try with both.

With text/html I get and your command I get

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings html
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: 
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not 
allowed in prolog.
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at 
org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at 
org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; 
Content is not allowed in prolog.
at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more


When I use "-filetype md" back to the regular output that doesn't scan
anything.


>> 
>> If you get past this hurdle this hurdle, let me know.
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log
>> > in
>> > >> the machine. I haven't played much with docker, any way you can get that
>> > >> file from that location.
>> >
>> > I see these files:
>> >
>> > /opt/solr/server/logs/archived
>> > /opt/solr/server/logs/solr_gc.log.0.current
>> > /opt/solr/server/logs/solr.log
>> > /opt/solr/server/solr/handbook/data/tlog
>> >
>> > The 3rd one has very little info.  Attached:
>> >
>> >
>> > 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
>> > jetty-9.3.14.v20161028
>> > 2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
>> > ___  _   Welcome to Apache Solr™ version 7.0.1
>> > 2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter /
>> > __| ___| |_ _   Starting in standalone mode on port 8983
>> > 2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__
>> > \/ _ \ | '_|  Install dir: /opt/solr, Default config dir:
>> > /opt/solr/server/solr/configsets/_default/conf
>> > 2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
>> > |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z
>> > 2017-10-11 15:28:10.747 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
>> > Using system property solr.solr.home: /opt/solr/server/solr
>> > 2017-10-11 15:28:10.763 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading
>> > container configuration from /opt/solr/server/solr/solr.xml
>> > 2017-10-11 15:28:11.062 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
>> > [null] Added 0 libs to classloader, from paths: []
>> > 2017-10-11 15:28:12.514 INFO  (main) [   ] o.a.s.c.CorePropertiesLocator
>> > Found 0 core definitions underneath /opt/solr/server/solr
>> > 2017-10-11 15:28:12.635 INFO  (main) [   ] o.e.j.s.Server Started @4304ms
>> > 2017-10-11 15:29:00.971 INFO  (qtp1911006827-13) [   ]
>> > o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system
>> > params={wt=json} status=0 QTime=108
>> > 2017-10-11 15

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

fileType => md is not recognizable format in SimplePostTool, anyway, moving
on.

The above is SAXParse, runtime exception. Nothing can be done at Solr end
except curating your own data.
Some helpful links:
https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error
https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Kevin,
> >>
> >> I am not able to replicate the issue on my system, which is bit annoying
> >> for me. Try this out for last time:
> >>
> >> docker exec -it --user=solr solr bin/post -c handbook
> >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0
> -filetypes html
> >>
> >> and have Content-Type: "html" and "text/html", try with both.
>
> With text/html I get and your command I get
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes
> html
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook
> -Ddata=web org.apache.solr.util.SimplePostTool
> http://quadra.franz.com:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings html
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Exception in thread "main" java.lang.RuntimeException:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
> not allowed in prolog.
> at org.apache.solr.util.SimplePostTool$PageFetcher.
> getLinksFromWebPage(SimplePostTool.java:1252)
> at org.apache.solr.util.SimplePostTool.webCrawl(
> SimplePostTool.java:616)
> at org.apache.solr.util.SimplePostTool.postWebPages(
> SimplePostTool.java:563)
> at org.apache.solr.util.SimplePostTool.doWebMode(
> SimplePostTool.java:365)
> at org.apache.solr.util.SimplePostTool.execute(
> SimplePostTool.java:187)
> at org.apache.solr.util.SimplePostTool.main(
> SimplePostTool.java:172)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
> Content is not allowed in prolog.
> at com.sun.org.apache.xerces.internal.parsers.DOMParser.
> parse(DOMParser.java:257)
> at com.sun.org.apache.xerces.internal.jaxp.
> DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
> at javax.xml.parsers.DocumentBuilder.parse(
> DocumentBuilder.java:121)
> at org.apache.solr.util.SimplePostTool.makeDom(
> SimplePostTool.java:1061)
> at org.apache.solr.util.SimplePostTool$PageFetcher.
> getLinksFromWebPage(SimplePostTool.java:1232)
> ... 5 more
>
>
> When I use "-filetype md" back to the regular output that doesn't scan
> anything.
>
>
> >>
> >> If you get past this hurdle this hurdle, let me know.
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:
> >>
> >> > Amrit Sarkar wrote:
> >> >
> >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/
> log
> >> > in
> >> > >> the machine. I haven't played much with docker, any way you can
> get that
> >> > >> file from that location.
> >> >
> >> > I see these files:
> >> >
> >> > /opt/solr/server/logs/archived
> >> > /opt/solr/server/logs/solr_gc.log.0.current
> >> > /opt/solr/server/logs/solr.log
> >> > /opt/solr/server/solr/handbook/data/tlog
> >> >
> >> > The 3rd one has very little info.  Attached:
> >> >
> >> >
> >> > 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
> >> > jetty-9.3.14.v20161028
> >> > 2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> >> > ___  _   Welcome to Apache Solr™ version 7.0.1
> >> > 2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> /
> >> > __| ___| |_ _   Starting in standalone mode on port 8983
> >> > 2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> \__
> >> > \/ _ \ | '_|  Install dir: /opt/solr, Default config dir:
> >> > /opt/solr/server/solr/configsets/_default/conf
> >> > 2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> >> > |__

Re: Parsing of rq queries in LTR

2017-10-13 Thread Michael Alcorn
I believe I've discovered a workaround. If you use:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q":"{!dismax qf=text_tfidf}${text}"
}
}

instead of:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!field f=issue_tfidf}${case_description}"
}
}

you can then use single quotes to incorporate multi-term arguments as
Alessandro suggested. I've added this information to the Jira.

On Thu, Oct 12, 2017 at 9:10 AM, Michael Alcorn  wrote:

> It turns out my last comment on that Jira was mistaken. Multi-term EFI
> arguments still exhibit unexpected behavior. Binoy is trying to help me
> figure out what the issue is. I plan on updating the Jira once we've
> figured out the problem.
>
> On Thu, Oct 12, 2017 at 3:41 AM, alessandro.benedetti <
> a.benede...@sease.io> wrote:
>
>> I don't think this is actually that much related to LTR Solr Feature.
>> In the Solr feature I see you specify a query with a specific query parser
>> (field).
>> Unless there is a bug in the SolrFeature for LTR, I expect the query
>> parser
>> you defined to be used[1].
>>
>> This means :
>>
>> "rawquerystring":"{!field f=full_name}alessandro benedetti",
>> "querystring":"{!field f=full_name}alessandro benedetti",
>> "parsedquery":"PhraseQuery(full_name:\"alessandro benedetti\")",
>> "parsedquery_toString":"full_name:\"alessandro benedetti\"",
>>
>> In relation to multi term EFI, you need to pass
>> efi.example='term1 term2' .
>> If not just one term will be passed as EFI.[2]
>> This is more likely to be your problem.
>> I don't think the dash should be relevant at all
>>
>> [1]
>> https://lucene.apache.org/solr/guide/6_6/other-parsers.html#
>> OtherParsers-FieldQueryParser
>> [2] https://issues.apache.org/jira/browse/SOLR-11386
>>
>>
>>
>>
>> -
>> ---
>> Alessandro Benedetti
>> Search Consultant, R&D Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>
>


Re: Strange Behavior When Extracting Features

2017-10-13 Thread Michael Alcorn
I believe I've discovered a workaround. If you use:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q":"{!dismax qf=text_tfidf}${text}"
}
}

instead of:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!field f=issue_tfidf}${case_description}"
}
}

you can then use single quotes to incorporate multi-term arguments as
Alessandro suggested. I've added this information to the Jira.

On Fri, Sep 22, 2017 at 8:30 AM, alessandro.benedetti 
wrote:

> I think this has nothing to do with the LTR plugin.
> The problem here should be just the way you use the local params,
> to properly pass multi term local params in Solr you need to use *'* :
>
> efi.case_description='added couple of fiber channel'
>
> This should work.
> If not only the first term will be passed as a local param and then passed
> in the efi map to LTR.
>
> I will update the Jira issue as well.
>
> Cheers
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: book on solr

2017-10-13 Thread Deepak Vohra
Use Docker with Kubernetes, which has autoscaling of Docker containers based on 
load.  Docker image for Solr is https://hub.docker.com/_/solr/

On Thu, 10/12/17, Jay Potharaju  wrote:

 Subject: book on solr
 To: solr-user@lucene.apache.org
 Received: Thursday, October 12, 2017, 10:42 PM
 
 Hi,
 I am looking for a book that covers
 some basic principles on how to scale
 solr. Are there any suggestions.
 Example how to scale , by adding shards
 or replicas in the case of high rps
 and high index rates.
 
 Any blog or documentation also that
 would provide some basic rules or
 guidelines for scaling would also be
 great.
 
 Thanks
 Jay Potharaju
 


zero-day exploit security issue

2017-10-13 Thread Xie, Sean
Is there a tracking to address this issue for SOLR 6.6.x and 7.x?

https://lucene.apache.org/solr/news.html#12-october-2017-please-secure-your-apache-solr-servers-since-a-zero-day-exploit-has-been-reported-on-a-public-mailing-list

Sean

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> fileType => md is not recognizable format in SimplePostTool, anyway, moving
>> on.

OK, thanks.  Looks like I'll have to abandon using solr for this
project (or find another way to crawl the site).

Thank you for all the help, though.  I appreciate it.

>> The above is SAXParse, runtime exception. Nothing can be done at Solr end
>> except curating your own data.
>> Some helpful links:
>> https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error
>> https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> I am not able to replicate the issue on my system, which is bit annoying
>> > >> for me. Try this out for last time:
>> > >>
>> > >> docker exec -it --user=solr solr bin/post -c handbook
>> > >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0
>> > -filetypes html
>> > >>
>> > >> and have Content-Type: "html" and "text/html", try with both.
>> >
>> > With text/html I get and your command I get
>> >
>> > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes
>> > html
>> > /docker-java-home/jre/bin/java -classpath 
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook
>> > -Ddata=web org.apache.solr.util.SimplePostTool
>> > http://quadra.franz.com:9091/index.md
>> > SimplePostTool version 5.0.0
>> > Posting web pages to Solr url http://localhost:8983/solr/
>> > handbook/update/extract
>> > Entering auto mode. Indexing pages with content-types corresponding to
>> > file endings html
>> > SimplePostTool: WARNING: Never crawl an external web site faster than
>> > every 10 seconds, your IP will probably be blocked
>> > Entering recursive mode, depth=10, delay=0s
>> > Entering crawl at level 0 (1 links total, 1 new)
>> > POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
>> > [Fatal Error] :1:1: Content is not allowed in prolog.
>> > Exception in thread "main" java.lang.RuntimeException:
>> > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
>> > not allowed in prolog.
>> > at org.apache.solr.util.SimplePostTool$PageFetcher.
>> > getLinksFromWebPage(SimplePostTool.java:1252)
>> > at org.apache.solr.util.SimplePostTool.webCrawl(
>> > SimplePostTool.java:616)
>> > at org.apache.solr.util.SimplePostTool.postWebPages(
>> > SimplePostTool.java:563)
>> > at org.apache.solr.util.SimplePostTool.doWebMode(
>> > SimplePostTool.java:365)
>> > at org.apache.solr.util.SimplePostTool.execute(
>> > SimplePostTool.java:187)
>> > at org.apache.solr.util.SimplePostTool.main(
>> > SimplePostTool.java:172)
>> > Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
>> > Content is not allowed in prolog.
>> > at com.sun.org.apache.xerces.internal.parsers.DOMParser.
>> > parse(DOMParser.java:257)
>> > at com.sun.org.apache.xerces.internal.jaxp.
>> > DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
>> > at javax.xml.parsers.DocumentBuilder.parse(
>> > DocumentBuilder.java:121)
>> > at org.apache.solr.util.SimplePostTool.makeDom(
>> > SimplePostTool.java:1061)
>> > at org.apache.solr.util.SimplePostTool$PageFetcher.
>> > getLinksFromWebPage(SimplePostTool.java:1232)
>> > ... 5 more
>> >
>> >
>> > When I use "-filetype md" back to the regular output that doesn't scan
>> > anything.
>> >
>> >
>> > >>
>> > >> If you get past this hurdle this hurdle, let me know.
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:
>> > >>
>> > >> > Amrit Sarkar wrote:
>> > >> >
>> > >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/
>> > log
>> > >> > in
>> > >> > >> the machine. I haven't played much with docker, any way you can
>> > get that
>> > >> > >> file from that location.
>> > >> >
>> > >> > I see these files:
>> > >> >
>> > >> > /opt/solr/server/logs/archived
>> > >> > /opt/solr/server/logs/solr_gc.log.0.current
>> > >> > /opt/solr/server/logs/solr.log
>> > >> > /opt/solr/server/solr/handbook/data/tlog
>> > >> >
>> > >> > The 3rd one has very little info.  Attached:
>> > >> >
>> > >> >
>> > >> > 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
>> > >> > jetty-9.3.14.v20161028
>> > >> > 201

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Rick Leir

On 2017-10-13 04:19 PM, Kevin Layer wrote:

Amrit Sarkar wrote:


Kevin,

fileType => md is not recognizable format in SimplePostTool, anyway, moving
on.

OK, thanks.  Looks like I'll have to abandon using solr for this
project (or find another way to crawl the site).

Thank you for all the help, though.  I appreciate it.

Ha, these messages crash my android mail client!  Now...

Did you try Nutch? Or the Narconex HTTP crawler? Tika? Or any Python 
crawler, posting its documents to th Solr API.

cheers -- Rick