Kevin, You are getting NPE at:
String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL // related code String rawContentType = conn.getContentType(); public String getContentType() { return getHeaderField("content-type"); } HttpURLConnection conn = (HttpURLConnection) u.openConnection(); Can you check at your webpage level headers are properly set and it has key "content-type". Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Wed, Oct 11, 2017 at 9:08 PM, Kevin Layer <la...@franz.com> wrote: > I want to use solr to index a markdown website. The files > are in native markdown, but they are served in HTML (by markserv). > > Here's what I did: > > docker run --name solr -d -p 8983:8983 -t solr > docker exec -it --user=solr solr bin/solr create_core -c handbook > > Then, to crawl the site: > > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md > /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web > org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md > SimplePostTool version 5.0.0 > Posting web pages to Solr url http://localhost:8983/solr/ > handbook/update/extract > Entering auto mode. Indexing pages with content-types corresponding to > file endings md > SimplePostTool: WARNING: Never crawl an external web site faster than > every 10 seconds, your IP will probably be blocked > Entering recursive mode, depth=10, delay=0s > Entering crawl at level 0 (1 links total, 1 new) > Exception in thread "main" java.lang.NullPointerException > at org.apache.solr.util.SimplePostTool$PageFetcher. > readPageFromUrl(SimplePostTool.java:1138) > at org.apache.solr.util.SimplePostTool.webCrawl( > SimplePostTool.java:603) > at org.apache.solr.util.SimplePostTool.postWebPages( > SimplePostTool.java:563) > at org.apache.solr.util.SimplePostTool.doWebMode( > SimplePostTool.java:365) > at org.apache.solr.util.SimplePostTool.execute( > SimplePostTool.java:187) > at org.apache.solr.util.SimplePostTool.main( > SimplePostTool.java:172) > quadra[git:master]$ > > > Any ideas on what I did wrong? > > Thanks. > > Kevin >