My recommendation, start with the default configset (data_driven_schema_configs) like this:
# grab an HTML page to use curl http://lucene.apache.org/solr/index.html > index.html bin/solr start bin/solr create -c html_test bin/post -c html_test index.html $ curl "http://localhost:8983/solr/html_test/select?q=*:*&wt=csv" stream_size,stream_content_type,keywords,x_parsed_by,content_encoding,distribution,title,content_type,viewport,_version_,dc_title,id,resourcename,robots 23049,text/html,"apache\, apache lucene\, apache solr\, solr\, lucene search\, information retrieval\, spell checking\, faceting\, inverted index\, open source","org.apache.tika.parser.DefaultParser,org.apache.tika.parser.html.HtmlParser",UTF-8,Global,Apache Solr -,text/html; charset=UTF-8,"minimal-ui\, initial-scale=1\, maximum-scale=1\, user-scalable=0",1508508085335883776,Apache Solr -,/Users/erikhatcher/dev/trunk/solr/index.html,/Users/erikhatcher/dev/trunk/solr/index.html,"index\,follow” If you’d like to enhance the extraction for specific xpaths, see <https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika#UploadingDatawithSolrCellusingApacheTika-InputParameters> - you can set these parameters on the upload, using -params (see the “Capturing and Mapping” example with -params on the bin/post) or by adjusting the settings of /update/extract in solrconfig.xml. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com > On Aug 3, 2015, at 2:00 PM, Huiying Ma <mahuiying...@gmail.com> wrote: > > Thanks Erik, > > I'm trying to index some html files in the same format and I need to index > them according to classes and tags. I've tried data_driven_schema_configs > but I can only get the title and id but not other tags and classes I > wanted. So now I want to edit the schema in the basic_configs but turned > out that error. So do you have any good idea for me? Also, I also tried to > use bin/post to post an xml file to that same core and it worked so I'm > wondering why the html file won't work. Thank you so much!! Since I don't > know much about solr, it's really good that some one can help! > > Best, > Huiying > > On Mon, Aug 3, 2015 at 1:54 PM, Erik Hatcher <erik.hatc...@gmail.com> wrote: > >> My hunch is that the basic_configs is *too* basic for your needs here. >> basic_configs does not include /update/extract - it’s very basic - stripped >> of all the “extra” components. >> >> Try using the default, data_driven_schema_configs instead. >> >> If you’re still having issues, please provide full details of what you’ve >> tried. >> >> — >> Erik Hatcher, Senior Solutions Architect >> http://www.lucidworks.com <http://www.lucidworks.com/> >> >> >> >> >>> On Aug 3, 2015, at 1:43 PM, Huiying Ma <mahuiying...@gmail.com> wrote: >>> >>> Hi everyone, >>> >>> I created a core with the basic config sets and schema, when I use >> bin/post >>> to post one html file, I got the error: >>> >>> SimplePostTool: WARNING: IOException while reading response: >>> java.io.FileNotFoundException...... >>> HTTP ERROR 404 >>> >>> when I go to localhost:8983/solr/core/update, I got: >>> <response> >>> <lst name="responseHeader"> >>> <int name="status">400</int> >>> <int name="QTime">3<int> >>> </lst> >>> <lst name="error"> >>> <str name="msg">missing content stream</str> >>> <int name="code">400</int> >>> </lst> >>> </response> >>> >>> I'm really new to solr and wondering if anyone know how to index html >> files >>> according to my own schema and how to configure the schema.xml or >>> solrconfig file. Thank you so much! >>> >>> Thanks, >>> Huiying >> >>