solr 7.0.1: exception running post to crawl simple website
I want to use solr to index a markdown website. The files are in native markdown, but they are served in HTML (by markserv). Here's what I did: docker run --name solr -d -p 8983:8983 -t solr docker exec -it --user=solr solr bin/solr create_core -c handbook Then, to crawl the site: quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings md SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) Exception in thread "main" java.lang.NullPointerException at org.apache.solr.util.SimplePostTool$PageFetcher.readPageFromUrl(SimplePostTool.java:1138) at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:603) at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) quadra[git:master]$ Any ideas on what I did wrong? Thanks. Kevin
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> You are getting NPE at: >> >> String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL >> >> // related code >> >> String rawContentType = conn.getContentType(); >> >> public String getContentType() { >> return getHeaderField("content-type"); >> } >> >> HttpURLConnection conn = (HttpURLConnection) u.openConnection(); >> >> Can you check at your webpage level headers are properly set and it >> has key "content-type". Amrit, this is markserv, and I just used wget to prove you are correct, there is no Content-Type header. Thanks for the help! I'll see if I can hack markserv to add that, and try again. Kevin
Re: solr 7.0.1: exception running post to crawl simple website
OK, so I hacked markserv to add Content-Type text/html, but now I get SimplePostTool: WARNING: Skipping URL with unsupported type text/html What is it expecting? $ docker exec -it --user=solr solr bin/post -c handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings md SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) SimplePostTool: WARNING: Skipping URL with unsupported type text/html SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP result status of 415 0 web pages indexed. COMMITting Solr index changes to http://localhost:8983/solr/handbook/update/extract... Time spent: 0:00:03.882 $ Thanks. Kevin
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Strange, >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> Content-Type. Let's see what it says now. Same thing. Verified Content-Type: quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& grep Content-Type Content-Type: text/html;charset=utf-8 quadra[git:master]$ ] quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings md SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) SimplePostTool: WARNING: Skipping URL with unsupported type text/html SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP result status of 415 0 web pages indexed. COMMITting Solr index changes to http://localhost:8983/solr/handbook/update/extract... Time spent: 0:00:00.531 quadra[git:master]$ Kevin >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer wrote: >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get >> > >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> > >> > What is it expecting? >> > >> > $ docker exec -it --user=solr solr bin/post -c handbook >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> > /docker-java-home/jre/bin/java -classpath >> > /opt/solr/dist/solr-core-7.0.1.jar >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> > SimplePostTool version 5.0.0 >> > Posting web pages to Solr url http://localhost:8983/solr/ >> > handbook/update/extract >> > Entering auto mode. Indexing pages with content-types corresponding to >> > file endings md >> > SimplePostTool: WARNING: Never crawl an external web site faster than >> > every 10 seconds, your IP will probably be blocked >> > Entering recursive mode, depth=10, delay=0s >> > Entering crawl at level 0 (1 links total, 1 new) >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >> > HTTP result status of 415 >> > 0 web pages indexed. >> > COMMITting Solr index changes to http://localhost:8983/solr/ >> > handbook/update/extract... >> > Time spent: 0:00:03.882 >> > $ >> > >> > Thanks. >> > >> > Kevin >> >
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Reference to the code: >> >> . >> >> String rawContentType = conn.getContentType(); >> String type = rawContentType.split(";")[0]; >> if(typeSupported(type) || "*".equals(fileTypes)) { >> String encoding = conn.getContentEncoding(); >> >> . >> >> protected boolean typeSupported(String type) { >> for(String key : mimeMap.keySet()) { >> if(mimeMap.get(key).equals(type)) { >> if(fileTypes.contains(key)) >> return true; >> } >> } >> return false; >> } >> >> . >> >> It has another check for fileTypes, I can see the page ending with .md >> (which you are indexing) and not .html. Let's hope now this is not the >> issue. Did you see the "-filetypes md" at the end of the post command line? Shouldn't that handle it? Kevin >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar >> wrote: >> >> > Kevin, >> > >> > Just put "html" too and give it a shot. These are the types it is >> > expecting: >> > >> > mimeMap = new HashMap<>(); >> > mimeMap.put("xml", "application/xml"); >> > mimeMap.put("csv", "text/csv"); >> > mimeMap.put("json", "application/json"); >> > mimeMap.put("jsonl", "application/json"); >> > mimeMap.put("pdf", "application/pdf"); >> > mimeMap.put("rtf", "text/rtf"); >> > mimeMap.put("html", "text/html"); >> > mimeMap.put("htm", "text/html"); >> > mimeMap.put("doc", "application/msword"); >> > mimeMap.put("docx", >> > "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); >> > mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > mimeMap.put("pptx", >> > "application/vnd.openxmlformats-officedocument.presentationml.presentation"); >> > mimeMap.put("xls", "application/vnd.ms-excel"); >> > mimeMap.put("xlsx", >> > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> > mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> > mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> > mimeMap.put("txt", "text/plain"); >> > mimeMap.put("log", "text/plain"); >> > >> > The keys are the types supported. >> > >> > >> > Amrit Sarkar >> > Search Engineer >> > Lucidworks, Inc. >> > 415-589-9269 >> > www.lucidworks.com >> > Twitter http://twitter.com/lucidworks >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar >> > wrote: >> > >> >> Ah! >> >> >> >> Only supported type is: text/html; encoding=utf-8 >> >> >> >> I am not confident of this either :) but this should work. >> >> >> >> See the code-snippet below: >> >> >> >> .. >> >> >> >> if(res.httpStatus == 200) { >> >> // Raw content type of form "text/html; encoding=utf-8" >> >> String rawContentType = conn.getContentType(); >> >> String type = rawContentType.split(";")[0]; >> >> if(typeSupported(type) || "*".equals(fileTypes)) { >> >> String encoding = conn.getContentEncoding(); >> >> >> >> >> >> >> >> >> >> Amrit Sarkar >> >> Search Engineer >> >> Lucidworks, Inc. >> >> 415-589-9269 >> >> www.lucidworks.com >> >&g
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> Just put "html" too and give it a shot. These are the types it is expecting: Same thing. >> >> mimeMap = new HashMap<>(); >> mimeMap.put("xml", "application/xml"); >> mimeMap.put("csv", "text/csv"); >> mimeMap.put("json", "application/json"); >> mimeMap.put("jsonl", "application/json"); >> mimeMap.put("pdf", "application/pdf"); >> mimeMap.put("rtf", "text/rtf"); >> mimeMap.put("html", "text/html"); >> mimeMap.put("htm", "text/html"); >> mimeMap.put("doc", "application/msword"); >> mimeMap.put("docx", >> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> mimeMap.put("pptx", >> "application/vnd.openxmlformats-officedocument.presentationml.presentation"); >> mimeMap.put("xls", "application/vnd.ms-excel"); >> mimeMap.put("xlsx", >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> mimeMap.put("txt", "text/plain"); >> mimeMap.put("log", "text/plain"); >> >> The keys are the types supported. >> >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar >> wrote: >> >> > Ah! >> > >> > Only supported type is: text/html; encoding=utf-8 >> > >> > I am not confident of this either :) but this should work. >> > >> > See the code-snippet below: >> > >> > .. >> > >> > if(res.httpStatus == 200) { >> > // Raw content type of form "text/html; encoding=utf-8" >> > String rawContentType = conn.getContentType(); >> > String type = rawContentType.split(";")[0]; >> > if(typeSupported(type) || "*".equals(fileTypes)) { >> > String encoding = conn.getContentEncoding(); >> > >> > >> > >> > >> > Amrit Sarkar >> > Search Engineer >> > Lucidworks, Inc. >> > 415-589-9269 >> > www.lucidworks.com >> > Twitter http://twitter.com/lucidworks >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: >> > >> >> Amrit Sarkar wrote: >> >> >> >> >> Strange, >> >> >> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> >> >> Content-Type. Let's see what it says now. >> >> >> >> Same thing. Verified Content-Type: >> >> >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >> >> grep Content-Type >> >> Content-Type: text/html;charset=utf-8 >> >> quadra[git:master]$ ] >> >> >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> >> /docker-java-home/jre/bin/java -classpath >> >> /opt/solr/dist/solr-core-7.0.1.jar >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >> SimplePostTool version 5.0.0 >> >> Posting web pages to Solr url http://localhost:8983/solr/han >> >> dbook/update/extract >> >> Entering auto mode. Indexing pages with content-types corresponding to >> >> file endings md >> >> SimplePostTo
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Hi Kevin, >> >> Can you post the solr log in the mail thread. I don't think it handled the >> .md by itself by first glance at code. How do I extract the log you want? >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer wrote: >> >> > Amrit Sarkar wrote: >> > >> > >> Kevin, >> > >> >> > >> Just put "html" too and give it a shot. These are the types it is >> > expecting: >> > >> > Same thing. >> > >> > >> >> > >> mimeMap = new HashMap<>(); >> > >> mimeMap.put("xml", "application/xml"); >> > >> mimeMap.put("csv", "text/csv"); >> > >> mimeMap.put("json", "application/json"); >> > >> mimeMap.put("jsonl", "application/json"); >> > >> mimeMap.put("pdf", "application/pdf"); >> > >> mimeMap.put("rtf", "text/rtf"); >> > >> mimeMap.put("html", "text/html"); >> > >> mimeMap.put("htm", "text/html"); >> > >> mimeMap.put("doc", "application/msword"); >> > >> mimeMap.put("docx", >> > >> "application/vnd.openxmlformats-officedocument. >> > wordprocessingml.document"); >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > >> mimeMap.put("pptx", >> > >> "application/vnd.openxmlformats-officedocument. >> > presentationml.presentation"); >> > >> mimeMap.put("xls", "application/vnd.ms-excel"); >> > >> mimeMap.put("xlsx", >> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("txt", "text/plain"); >> > >> mimeMap.put("log", "text/plain"); >> > >> >> > >> The keys are the types supported. >> > >> >> > >> >> > >> Amrit Sarkar >> > >> Search Engineer >> > >> Lucidworks, Inc. >> > >> 415-589-9269 >> > >> www.lucidworks.com >> > >> Twitter http://twitter.com/lucidworks >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar >> > >> wrote: >> > >> >> > >> > Ah! >> > >> > >> > >> > Only supported type is: text/html; encoding=utf-8 >> > >> > >> > >> > I am not confident of this either :) but this should work. >> > >> > >> > >> > See the code-snippet below: >> > >> > >> > >> > .. >> > >> > >> > >> > if(res.httpStatus == 200) { >> > >> > // Raw content type of form "text/html; encoding=utf-8" >> > >> > String rawContentType = conn.getContentType(); >> > >> > String type = rawContentType.split(";")[0]; >> > >> > if(typeSupported(type) || "*".equals(fileTypes)) { >> > >> > String encoding = conn.getContentEncoding(); >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > Amrit Sarkar >> > >> > Search Engineer >> > >> > Lucidworks, Inc. >> > >> > 415-589-9269 >> > >> > www.lucidworks.com >
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Hi Kevin, >> >> Can you post the solr log in the mail thread. I don't think it handled the >> .md by itself by first glance at code. Note that when I use the admin web interface, and click on "Logging" on the left, I just see a spinner that implies it's trying to retrieve the logs (I see headers "Time (Local) Level CoreLogger Message"), but no log entries. It's been like this for 10 minutes. >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer wrote: >> >> > Amrit Sarkar wrote: >> > >> > >> Kevin, >> > >> >> > >> Just put "html" too and give it a shot. These are the types it is >> > expecting: >> > >> > Same thing. >> > >> > >> >> > >> mimeMap = new HashMap<>(); >> > >> mimeMap.put("xml", "application/xml"); >> > >> mimeMap.put("csv", "text/csv"); >> > >> mimeMap.put("json", "application/json"); >> > >> mimeMap.put("jsonl", "application/json"); >> > >> mimeMap.put("pdf", "application/pdf"); >> > >> mimeMap.put("rtf", "text/rtf"); >> > >> mimeMap.put("html", "text/html"); >> > >> mimeMap.put("htm", "text/html"); >> > >> mimeMap.put("doc", "application/msword"); >> > >> mimeMap.put("docx", >> > >> "application/vnd.openxmlformats-officedocument. >> > wordprocessingml.document"); >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > >> mimeMap.put("pptx", >> > >> "application/vnd.openxmlformats-officedocument. >> > presentationml.presentation"); >> > >> mimeMap.put("xls", "application/vnd.ms-excel"); >> > >> mimeMap.put("xlsx", >> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("txt", "text/plain"); >> > >> mimeMap.put("log", "text/plain"); >> > >> >> > >> The keys are the types supported. >> > >> >> > >> >> > >> Amrit Sarkar >> > >> Search Engineer >> > >> Lucidworks, Inc. >> > >> 415-589-9269 >> > >> www.lucidworks.com >> > >> Twitter http://twitter.com/lucidworks >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar >> > >> wrote: >> > >> >> > >> > Ah! >> > >> > >> > >> > Only supported type is: text/html; encoding=utf-8 >> > >> > >> > >> > I am not confident of this either :) but this should work. >> > >> > >> > >> > See the code-snippet below: >> > >> > >> > >> > .. >> > >> > >> > >> > if(res.httpStatus == 200) { >> > >> > // Raw content type of form "text/html; encoding=utf-8" >> > >> > String rawContentType = conn.getContentType(); >> > >> > String type = rawContentType.split(";")[0]; >> > >> > if(typeSupported(type) || "*".equals(fileTypes)) { >> > >> > String encoding = conn.getContentEncoding(); >> > >> > >> > >> > >>
Re: solr 7.0.1: exception running post to crawl simple website
mp;_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:48:38.831 INFO (qtp1911006827-16) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:48:48.833 INFO (qtp1911006827-13) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:48:58.833 INFO (qtp1911006827-13) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:49:08.834 INFO (qtp1911006827-15) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:49:18.832 INFO (qtp1911006827-17) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:49:28.835 INFO (qtp1911006827-11) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:49:38.861 INFO (qtp1911006827-14) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=14 2017-10-13 14:49:48.853 INFO (qtp1911006827-18) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:49:58.837 INFO (qtp1911006827-20) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 2017-10-13 14:50:08.833 INFO (qtp1911006827-16) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/logging params={wt=json&_=1507905257696&since=0} status=0 QTime=0 >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layer wrote: >> >> > Amrit Sarkar wrote: >> > >> > >> Hi Kevin, >> > >> >> > >> Can you post the solr log in the mail thread. I don't think it handled >> > the >> > >> .md by itself by first glance at code. >> > >> > How do I extract the log you want? >> > >> > >> > >> >> > >> Amrit Sarkar >> > >> Search Engineer >> > >> Lucidworks, Inc. >> > >> 415-589-9269 >> > >> www.lucidworks.com >> > >> Twitter http://twitter.com/lucidworks >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> >> > >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer wrote: >> > >> >> > >> > Amrit Sarkar wrote: >> > >> > >> > >> > >> Kevin, >> > >> > >> >> > >> > >> Just put "html" too and give it a shot. These are the types it is >> > >> > expecting: >> > >> > >> > >> > Same thing. >> > >> > >> > >> > >> >> > >> > >> mimeMap = new HashMap<>(); >> > >> > >> mimeMap.put("xml", "application/xml"); >> > >> > >> mimeMap.put("csv", "text/csv"); >> > >> > >> mimeMap.put("json", "application/json"); >> > >> > >> mimeMap.put("jsonl", "application/json"); >> > >> > >> mimeMap.put("pdf", "application/pdf"); >> > >> > >> mimeMap.put("rtf", "text/rtf"); >> > >> > >> mimeMap.put("html", "text/html"); >> > >> > >> mimeMap.put("htm", "text/html"); >> > >> > >> mimeMap.put("doc", "application/msword"); >> > >> > >> mimeMap.put("docx", >> > >> > >> "application/vnd.openxmlformats-officedocument. >> > >> > wordprocessingml.document"); >> > >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > >> > >> mimeMap.put("pptx", >> > >> > >> "application/vnd.openxmlformats-officedocument. >> > >> > pr
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> I am not able to replicate the issue on my system, which is bit annoying >> for me. Try this out for last time: >> >> docker exec -it --user=solr solr bin/post -c handbook >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html >> >> and have Content-Type: "html" and "text/html", try with both. With text/html I get and your command I get quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings html SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0) [Fatal Error] :1:1: Content is not allowed in prolog. Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252) at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616) at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061) at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232) ... 5 more When I use "-filetype md" back to the regular output that doesn't scan anything. >> >> If you get past this hurdle this hurdle, let me know. >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer wrote: >> >> > Amrit Sarkar wrote: >> > >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log >> > in >> > >> the machine. I haven't played much with docker, any way you can get that >> > >> file from that location. >> > >> > I see these files: >> > >> > /opt/solr/server/logs/archived >> > /opt/solr/server/logs/solr_gc.log.0.current >> > /opt/solr/server/logs/solr.log >> > /opt/solr/server/solr/handbook/data/tlog >> > >> > The 3rd one has very little info. Attached: >> > >> > >> > 2017-10-11 15:28:09.564 INFO (main) [ ] o.e.j.s.Server >> > jetty-9.3.14.v20161028 >> > 2017-10-11 15:28:10.668 INFO (main) [ ] o.a.s.s.SolrDispatchFilter >> > ___ _ Welcome to Apache Solr™ version 7.0.1 >> > 2017-10-11 15:28:10.669 INFO (main) [ ] o.a.s.s.SolrDispatchFilter / >> > __| ___| |_ _ Starting in standalone mode on port 8983 >> > 2017-10-11 15:28:10.670 INFO (main) [ ] o.a.s.s.SolrDispatchFilter \__ >> > \/ _ \ | '_| Install dir: /opt/solr, Default config dir: >> > /opt/solr/server/solr/configsets/_default/conf >> > 2017-10-11 15:28:10.707 INFO (main) [ ] o.a.s.s.SolrDispatchFilter >> > |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z >> > 2017-10-11 15:28:10.747 INFO (main) [ ] o.a.s.c.SolrResourceLoader >> > Using system property solr.solr.home: /opt/solr/server/solr >> > 2017-10-11 15:28:10.763 INFO (main) [ ] o.a.s.c.SolrXmlConfig Loading >> > container configuration from /opt/solr/server/solr/solr.xml >> > 2017-10-11 15:28:11.062 INFO
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> fileType => md is not recognizable format in SimplePostTool, anyway, moving >> on. OK, thanks. Looks like I'll have to abandon using solr for this project (or find another way to crawl the site). Thank you for all the help, though. I appreciate it. >> The above is SAXParse, runtime exception. Nothing can be done at Solr end >> except curating your own data. >> Some helpful links: >> https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error >> https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layer wrote: >> >> > Amrit Sarkar wrote: >> > >> > >> Kevin, >> > >> >> > >> I am not able to replicate the issue on my system, which is bit annoying >> > >> for me. Try this out for last time: >> > >> >> > >> docker exec -it --user=solr solr bin/post -c handbook >> > >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 >> > -filetypes html >> > >> >> > >> and have Content-Type: "html" and "text/html", try with both. >> > >> > With text/html I get and your command I get >> > >> > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook >> > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes >> > html >> > /docker-java-home/jre/bin/java -classpath >> > /opt/solr/dist/solr-core-7.0.1.jar >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook >> > -Ddata=web org.apache.solr.util.SimplePostTool >> > http://quadra.franz.com:9091/index.md >> > SimplePostTool version 5.0.0 >> > Posting web pages to Solr url http://localhost:8983/solr/ >> > handbook/update/extract >> > Entering auto mode. Indexing pages with content-types corresponding to >> > file endings html >> > SimplePostTool: WARNING: Never crawl an external web site faster than >> > every 10 seconds, your IP will probably be blocked >> > Entering recursive mode, depth=10, delay=0s >> > Entering crawl at level 0 (1 links total, 1 new) >> > POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0) >> > [Fatal Error] :1:1: Content is not allowed in prolog. >> > Exception in thread "main" java.lang.RuntimeException: >> > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is >> > not allowed in prolog. >> > at org.apache.solr.util.SimplePostTool$PageFetcher. >> > getLinksFromWebPage(SimplePostTool.java:1252) >> > at org.apache.solr.util.SimplePostTool.webCrawl( >> > SimplePostTool.java:616) >> > at org.apache.solr.util.SimplePostTool.postWebPages( >> > SimplePostTool.java:563) >> > at org.apache.solr.util.SimplePostTool.doWebMode( >> > SimplePostTool.java:365) >> > at org.apache.solr.util.SimplePostTool.execute( >> > SimplePostTool.java:187) >> > at org.apache.solr.util.SimplePostTool.main( >> > SimplePostTool.java:172) >> > Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; >> > Content is not allowed in prolog. >> > at com.sun.org.apache.xerces.internal.parsers.DOMParser. >> > parse(DOMParser.java:257) >> > at com.sun.org.apache.xerces.internal.jaxp. >> > DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) >> > at javax.xml.parsers.DocumentBuilder.parse( >> > DocumentBuilder.java:121) >> > at org.apache.solr.util.SimplePostTool.makeDom( >> > SimplePostTool.java:1061) >> > at org.apache.solr.util.SimplePostTool$PageFetcher. >> > getLinksFromWebPage(SimplePostTool.java:1232) >> > ... 5 more >> > >> > >> > When I use "-filetype md" back to the regular output that doesn't scan >> > anything. >> > >> > >> > >> >> > >> If you get past this hurdle this hurdle, let me know. >> > >> >> > >> Amrit Sarkar >> > >> Search Engineer >> > >> Lucidworks