solr 7.0.1: exception running post to crawl simple website

2017-10-11 Thread Kevin Layer
I want to use solr to index a markdown website.  The files
are in native markdown, but they are served in HTML (by markserv).

Here's what I did:

docker run --name solr -d -p 8983:8983 -t solr
docker exec -it --user=solr solr bin/solr create_core -c handbook

Then, to crawl the site:

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.solr.util.SimplePostTool$PageFetcher.readPageFromUrl(SimplePostTool.java:1138)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:603)
at 
org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at 
org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
quadra[git:master]$ 


Any ideas on what I did wrong?

Thanks.

Kevin


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> You are getting NPE at:
>> 
>> String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL
>> 
>> // related code
>> 
>> String rawContentType = conn.getContentType();
>> 
>> public String getContentType() {
>> return getHeaderField("content-type");
>> }
>> 
>> HttpURLConnection conn = (HttpURLConnection) u.openConnection();
>> 
>> Can you check at your webpage level headers are properly set and it
>> has key "content-type".

Amrit, this is markserv, and I just used wget to prove you are
correct, there is no Content-Type header.

Thanks for the help!  I'll see if I can hack markserv to add that, and
try again.

Kevin


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
OK, so I hacked markserv to add Content-Type text/html, but now I get

SimplePostTool: WARNING: Skipping URL with unsupported type text/html

What is it expecting?

$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Skipping URL with unsupported type text/html
SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP 
result status of 415
0 web pages indexed.
COMMITting Solr index changes to 
http://localhost:8983/solr/handbook/update/extract...
Time spent: 0:00:03.882
$ 

Thanks.

Kevin


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Strange,
>> 
>> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> Content-Type. Let's see what it says now.

Same thing.  Verified Content-Type:

quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& grep 
Content-Type
  Content-Type: text/html;charset=utf-8
quadra[git:master]$ ]

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Skipping URL with unsupported type text/html
SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP 
result status of 415
0 web pages indexed.
COMMITting Solr index changes to 
http://localhost:8983/solr/handbook/update/extract...
Time spent: 0:00:00.531
quadra[git:master]$ 

Kevin

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
>> 
>> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >
>> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >
>> > What is it expecting?
>> >
>> > $ docker exec -it --user=solr solr bin/post -c handbook
>> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> > /docker-java-home/jre/bin/java -classpath 
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> > SimplePostTool version 5.0.0
>> > Posting web pages to Solr url http://localhost:8983/solr/
>> > handbook/update/extract
>> > Entering auto mode. Indexing pages with content-types corresponding to
>> > file endings md
>> > SimplePostTool: WARNING: Never crawl an external web site faster than
>> > every 10 seconds, your IP will probably be blocked
>> > Entering recursive mode, depth=10, delay=0s
>> > Entering crawl at level 0 (1 links total, 1 new)
>> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> > HTTP result status of 415
>> > 0 web pages indexed.
>> > COMMITting Solr index changes to http://localhost:8983/solr/
>> > handbook/update/extract...
>> > Time spent: 0:00:03.882
>> > $
>> >
>> > Thanks.
>> >
>> > Kevin
>> >


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Reference to the code:
>> 
>> .
>> 
>> String rawContentType = conn.getContentType();
>> String type = rawContentType.split(";")[0];
>> if(typeSupported(type) || "*".equals(fileTypes)) {
>>   String encoding = conn.getContentEncoding();
>> 
>> .
>> 
>> protected boolean typeSupported(String type) {
>>   for(String key : mimeMap.keySet()) {
>> if(mimeMap.get(key).equals(type)) {
>>   if(fileTypes.contains(key))
>> return true;
>> }
>>   }
>>   return false;
>> }
>> 
>> .
>> 
>> It has another check for fileTypes, I can see the page ending with .md
>> (which you are indexing) and not .html. Let's hope now this is not the
>> issue.

Did you see the "-filetypes md" at the end of the post command line?
Shouldn't that handle it?

Kevin

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar 
>> wrote:
>> 
>> > Kevin,
>> >
>> > Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > mimeMap = new HashMap<>();
>> > mimeMap.put("xml", "application/xml");
>> > mimeMap.put("csv", "text/csv");
>> > mimeMap.put("json", "application/json");
>> > mimeMap.put("jsonl", "application/json");
>> > mimeMap.put("pdf", "application/pdf");
>> > mimeMap.put("rtf", "text/rtf");
>> > mimeMap.put("html", "text/html");
>> > mimeMap.put("htm", "text/html");
>> > mimeMap.put("doc", "application/msword");
>> > mimeMap.put("docx", 
>> > "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
>> > mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > mimeMap.put("pptx", 
>> > "application/vnd.openxmlformats-officedocument.presentationml.presentation");
>> > mimeMap.put("xls", "application/vnd.ms-excel");
>> > mimeMap.put("xlsx", 
>> > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("txt", "text/plain");
>> > mimeMap.put("log", "text/plain");
>> >
>> > The keys are the types supported.
>> >
>> >
>> > Amrit Sarkar
>> > Search Engineer
>> > Lucidworks, Inc.
>> > 415-589-9269
>> > www.lucidworks.com
>> > Twitter http://twitter.com/lucidworks
>> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >
>> > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > wrote:
>> >
>> >> Ah!
>> >>
>> >> Only supported type is: text/html; encoding=utf-8
>> >>
>> >> I am not confident of this either :) but this should work.
>> >>
>> >> See the code-snippet below:
>> >>
>> >> ..
>> >>
>> >> if(res.httpStatus == 200) {
>> >>   // Raw content type of form "text/html; encoding=utf-8"
>> >>   String rawContentType = conn.getContentType();
>> >>   String type = rawContentType.split(";")[0];
>> >>   if(typeSupported(type) || "*".equals(fileTypes)) {
>> >> String encoding = conn.getContentEncoding();
>> >>
>> >> 
>> >>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >&g

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> Just put "html" too and give it a shot. These are the types it is expecting:

Same thing.

>> 
>> mimeMap = new HashMap<>();
>> mimeMap.put("xml", "application/xml");
>> mimeMap.put("csv", "text/csv");
>> mimeMap.put("json", "application/json");
>> mimeMap.put("jsonl", "application/json");
>> mimeMap.put("pdf", "application/pdf");
>> mimeMap.put("rtf", "text/rtf");
>> mimeMap.put("html", "text/html");
>> mimeMap.put("htm", "text/html");
>> mimeMap.put("doc", "application/msword");
>> mimeMap.put("docx",
>> "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
>> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> mimeMap.put("pptx",
>> "application/vnd.openxmlformats-officedocument.presentationml.presentation");
>> mimeMap.put("xls", "application/vnd.ms-excel");
>> mimeMap.put("xlsx",
>> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> mimeMap.put("txt", "text/plain");
>> mimeMap.put("log", "text/plain");
>> 
>> The keys are the types supported.
>> 
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> wrote:
>> 
>> > Ah!
>> >
>> > Only supported type is: text/html; encoding=utf-8
>> >
>> > I am not confident of this either :) but this should work.
>> >
>> > See the code-snippet below:
>> >
>> > ..
>> >
>> > if(res.httpStatus == 200) {
>> >   // Raw content type of form "text/html; encoding=utf-8"
>> >   String rawContentType = conn.getContentType();
>> >   String type = rawContentType.split(";")[0];
>> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > String encoding = conn.getContentEncoding();
>> >
>> > 
>> >
>> >
>> > Amrit Sarkar
>> > Search Engineer
>> > Lucidworks, Inc.
>> > 415-589-9269
>> > www.lucidworks.com
>> > Twitter http://twitter.com/lucidworks
>> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >
>> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> >
>> >> Amrit Sarkar wrote:
>> >>
>> >> >> Strange,
>> >> >>
>> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> >> >> Content-Type. Let's see what it says now.
>> >>
>> >> Same thing.  Verified Content-Type:
>> >>
>> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> >> grep Content-Type
>> >>   Content-Type: text/html;charset=utf-8
>> >> quadra[git:master]$ ]
>> >>
>> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> /docker-java-home/jre/bin/java -classpath 
>> >> /opt/solr/dist/solr-core-7.0.1.jar
>> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >> SimplePostTool version 5.0.0
>> >> Posting web pages to Solr url http://localhost:8983/solr/han
>> >> dbook/update/extract
>> >> Entering auto mode. Indexing pages with content-types corresponding to
>> >> file endings md
>> >> SimplePostTo

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Hi Kevin,
>> 
>> Can you post the solr log in the mail thread. I don't think it handled the
>> .md by itself by first glance at code.

How do I extract the log you want?


>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > Same thing.
>> >
>> > >>
>> > >> mimeMap = new HashMap<>();
>> > >> mimeMap.put("xml", "application/xml");
>> > >> mimeMap.put("csv", "text/csv");
>> > >> mimeMap.put("json", "application/json");
>> > >> mimeMap.put("jsonl", "application/json");
>> > >> mimeMap.put("pdf", "application/pdf");
>> > >> mimeMap.put("rtf", "text/rtf");
>> > >> mimeMap.put("html", "text/html");
>> > >> mimeMap.put("htm", "text/html");
>> > >> mimeMap.put("doc", "application/msword");
>> > >> mimeMap.put("docx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > wordprocessingml.document");
>> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > >> mimeMap.put("pptx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > presentationml.presentation");
>> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> > >> mimeMap.put("xlsx",
>> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("txt", "text/plain");
>> > >> mimeMap.put("log", "text/plain");
>> > >>
>> > >> The keys are the types supported.
>> > >>
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > >> wrote:
>> > >>
>> > >> > Ah!
>> > >> >
>> > >> > Only supported type is: text/html; encoding=utf-8
>> > >> >
>> > >> > I am not confident of this either :) but this should work.
>> > >> >
>> > >> > See the code-snippet below:
>> > >> >
>> > >> > ..
>> > >> >
>> > >> > if(res.httpStatus == 200) {
>> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> > >> >   String rawContentType = conn.getContentType();
>> > >> >   String type = rawContentType.split(";")[0];
>> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > >> > String encoding = conn.getContentEncoding();
>> > >> >
>> > >> > 
>> > >> >
>> > >> >
>> > >> > Amrit Sarkar
>> > >> > Search Engineer
>> > >> > Lucidworks, Inc.
>> > >> > 415-589-9269
>> > >> > www.lucidworks.com
>

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Hi Kevin,
>> 
>> Can you post the solr log in the mail thread. I don't think it handled the
>> .md by itself by first glance at code.

Note that when I use the admin web interface, and click on "Logging"
on the left, I just see a spinner that implies it's trying to retrieve
the logs (I see headers "Time (Local)   Level   CoreLogger  Message"),
but no log entries.  It's been like this for 10 minutes.

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > Same thing.
>> >
>> > >>
>> > >> mimeMap = new HashMap<>();
>> > >> mimeMap.put("xml", "application/xml");
>> > >> mimeMap.put("csv", "text/csv");
>> > >> mimeMap.put("json", "application/json");
>> > >> mimeMap.put("jsonl", "application/json");
>> > >> mimeMap.put("pdf", "application/pdf");
>> > >> mimeMap.put("rtf", "text/rtf");
>> > >> mimeMap.put("html", "text/html");
>> > >> mimeMap.put("htm", "text/html");
>> > >> mimeMap.put("doc", "application/msword");
>> > >> mimeMap.put("docx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > wordprocessingml.document");
>> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > >> mimeMap.put("pptx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > presentationml.presentation");
>> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> > >> mimeMap.put("xlsx",
>> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("txt", "text/plain");
>> > >> mimeMap.put("log", "text/plain");
>> > >>
>> > >> The keys are the types supported.
>> > >>
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > >> wrote:
>> > >>
>> > >> > Ah!
>> > >> >
>> > >> > Only supported type is: text/html; encoding=utf-8
>> > >> >
>> > >> > I am not confident of this either :) but this should work.
>> > >> >
>> > >> > See the code-snippet below:
>> > >> >
>> > >> > ..
>> > >> >
>> > >> > if(res.httpStatus == 200) {
>> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> > >> >   String rawContentType = conn.getContentType();
>> > >> >   String type = rawContentType.split(";")[0];
>> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > >> > String encoding = conn.getContentEncoding();
>> > >> >
>> > >> > 
>> 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
mp;_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:48:38.831 INFO  (qtp1911006827-16) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:48:48.833 INFO  (qtp1911006827-13) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:48:58.833 INFO  (qtp1911006827-13) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:49:08.834 INFO  (qtp1911006827-15) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:49:18.832 INFO  (qtp1911006827-17) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:49:28.835 INFO  (qtp1911006827-11) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:49:38.861 INFO  (qtp1911006827-14) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=14
2017-10-13 14:49:48.853 INFO  (qtp1911006827-18) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:49:58.837 INFO  (qtp1911006827-20) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0
2017-10-13 14:50:08.833 INFO  (qtp1911006827-16) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/logging 
params={wt=json&_=1507905257696&since=0} status=0 QTime=0



>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Hi Kevin,
>> > >>
>> > >> Can you post the solr log in the mail thread. I don't think it handled
>> > the
>> > >> .md by itself by first glance at code.
>> >
>> > How do I extract the log you want?
>> >
>> >
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> > >>
>> > >> > Amrit Sarkar wrote:
>> > >> >
>> > >> > >> Kevin,
>> > >> > >>
>> > >> > >> Just put "html" too and give it a shot. These are the types it is
>> > >> > expecting:
>> > >> >
>> > >> > Same thing.
>> > >> >
>> > >> > >>
>> > >> > >> mimeMap = new HashMap<>();
>> > >> > >> mimeMap.put("xml", "application/xml");
>> > >> > >> mimeMap.put("csv", "text/csv");
>> > >> > >> mimeMap.put("json", "application/json");
>> > >> > >> mimeMap.put("jsonl", "application/json");
>> > >> > >> mimeMap.put("pdf", "application/pdf");
>> > >> > >> mimeMap.put("rtf", "text/rtf");
>> > >> > >> mimeMap.put("html", "text/html");
>> > >> > >> mimeMap.put("htm", "text/html");
>> > >> > >> mimeMap.put("doc", "application/msword");
>> > >> > >> mimeMap.put("docx",
>> > >> > >> "application/vnd.openxmlformats-officedocument.
>> > >> > wordprocessingml.document");
>> > >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > >> > >> mimeMap.put("pptx",
>> > >> > >> "application/vnd.openxmlformats-officedocument.
>> > >> > pr

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> I am not able to replicate the issue on my system, which is bit annoying
>> for me. Try this out for last time:
>> 
>> docker exec -it --user=solr solr bin/post -c handbook
>> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html
>> 
>> and have Content-Type: "html" and "text/html", try with both.

With text/html I get and your command I get

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings html
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: 
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not 
allowed in prolog.
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at 
org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at 
org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; 
Content is not allowed in prolog.
at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more


When I use "-filetype md" back to the regular output that doesn't scan
anything.


>> 
>> If you get past this hurdle this hurdle, let me know.
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log
>> > in
>> > >> the machine. I haven't played much with docker, any way you can get that
>> > >> file from that location.
>> >
>> > I see these files:
>> >
>> > /opt/solr/server/logs/archived
>> > /opt/solr/server/logs/solr_gc.log.0.current
>> > /opt/solr/server/logs/solr.log
>> > /opt/solr/server/solr/handbook/data/tlog
>> >
>> > The 3rd one has very little info.  Attached:
>> >
>> >
>> > 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
>> > jetty-9.3.14.v20161028
>> > 2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
>> > ___  _   Welcome to Apache Solr™ version 7.0.1
>> > 2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter /
>> > __| ___| |_ _   Starting in standalone mode on port 8983
>> > 2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__
>> > \/ _ \ | '_|  Install dir: /opt/solr, Default config dir:
>> > /opt/solr/server/solr/configsets/_default/conf
>> > 2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
>> > |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z
>> > 2017-10-11 15:28:10.747 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
>> > Using system property solr.solr.home: /opt/solr/server/solr
>> > 2017-10-11 15:28:10.763 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading
>> > container configuration from /opt/solr/server/solr/solr.xml
>> > 2017-10-11 15:28:11.062 INFO  

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> fileType => md is not recognizable format in SimplePostTool, anyway, moving
>> on.

OK, thanks.  Looks like I'll have to abandon using solr for this
project (or find another way to crawl the site).

Thank you for all the help, though.  I appreciate it.

>> The above is SAXParse, runtime exception. Nothing can be done at Solr end
>> except curating your own data.
>> Some helpful links:
>> https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error
>> https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> I am not able to replicate the issue on my system, which is bit annoying
>> > >> for me. Try this out for last time:
>> > >>
>> > >> docker exec -it --user=solr solr bin/post -c handbook
>> > >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0
>> > -filetypes html
>> > >>
>> > >> and have Content-Type: "html" and "text/html", try with both.
>> >
>> > With text/html I get and your command I get
>> >
>> > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes
>> > html
>> > /docker-java-home/jre/bin/java -classpath 
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook
>> > -Ddata=web org.apache.solr.util.SimplePostTool
>> > http://quadra.franz.com:9091/index.md
>> > SimplePostTool version 5.0.0
>> > Posting web pages to Solr url http://localhost:8983/solr/
>> > handbook/update/extract
>> > Entering auto mode. Indexing pages with content-types corresponding to
>> > file endings html
>> > SimplePostTool: WARNING: Never crawl an external web site faster than
>> > every 10 seconds, your IP will probably be blocked
>> > Entering recursive mode, depth=10, delay=0s
>> > Entering crawl at level 0 (1 links total, 1 new)
>> > POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
>> > [Fatal Error] :1:1: Content is not allowed in prolog.
>> > Exception in thread "main" java.lang.RuntimeException:
>> > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
>> > not allowed in prolog.
>> > at org.apache.solr.util.SimplePostTool$PageFetcher.
>> > getLinksFromWebPage(SimplePostTool.java:1252)
>> > at org.apache.solr.util.SimplePostTool.webCrawl(
>> > SimplePostTool.java:616)
>> > at org.apache.solr.util.SimplePostTool.postWebPages(
>> > SimplePostTool.java:563)
>> > at org.apache.solr.util.SimplePostTool.doWebMode(
>> > SimplePostTool.java:365)
>> > at org.apache.solr.util.SimplePostTool.execute(
>> > SimplePostTool.java:187)
>> > at org.apache.solr.util.SimplePostTool.main(
>> > SimplePostTool.java:172)
>> > Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
>> > Content is not allowed in prolog.
>> > at com.sun.org.apache.xerces.internal.parsers.DOMParser.
>> > parse(DOMParser.java:257)
>> > at com.sun.org.apache.xerces.internal.jaxp.
>> > DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
>> > at javax.xml.parsers.DocumentBuilder.parse(
>> > DocumentBuilder.java:121)
>> > at org.apache.solr.util.SimplePostTool.makeDom(
>> > SimplePostTool.java:1061)
>> > at org.apache.solr.util.SimplePostTool$PageFetcher.
>> > getLinksFromWebPage(SimplePostTool.java:1232)
>> > ... 5 more
>> >
>> >
>> > When I use "-filetype md" back to the regular output that doesn't scan
>> > anything.
>> >
>> >
>> > >>
>> > >> If you get past this hurdle this hurdle, let me know.
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks