Re: posting html files

Erik Hatcher Mon, 03 Aug 2015 11:18:43 -0700

My recommendation, start with the default configset 
(data_driven_schema_configs) like this:


  # grab an HTML page to use
  curl http://lucene.apache.org/solr/index.html > index.html

  bin/solr start
  bin/solr create -c html_test
  bin/post -c html_test index.html

$ curl "http://localhost:8983/solr/html_test/select?q=*:*&wt=csv";
stream_size,stream_content_type,keywords,x_parsed_by,content_encoding,distribution,title,content_type,viewport,_version_,dc_title,id,resourcename,robots
23049,text/html,"apache\, apache lucene\, apache solr\, solr\, lucene           
  search\, information retrieval\, spell checking\, faceting\, inverted index\, 
open 
source","org.apache.tika.parser.DefaultParser,org.apache.tika.parser.html.HtmlParser",UTF-8,Global,Apache
 Solr -,text/html; charset=UTF-8,"minimal-ui\, initial-scale=1\, 
maximum-scale=1\, user-scalable=0",1508508085335883776,Apache Solr 
-,/Users/erikhatcher/dev/trunk/solr/index.html,/Users/erikhatcher/dev/trunk/solr/index.html,"index\,follow”

If you’d like to enhance the extraction for specific xpaths, see 
<https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika#UploadingDatawithSolrCellusingApacheTika-InputParameters>
 - you can set these parameters on the upload, using -params (see the 
“Capturing and Mapping” example with -params on the bin/post) or by adjusting 
the settings of /update/extract in solrconfig.xml.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com




> On Aug 3, 2015, at 2:00 PM, Huiying Ma <mahuiying...@gmail.com> wrote:
> 
> Thanks Erik,
> 
> I'm trying to index some html files in the same format and I need to index
> them according to classes and tags. I've tried data_driven_schema_configs
> but I can only get the title and id but not other tags and classes I
> wanted. So now I want to edit the schema in the basic_configs but turned
> out that error. So do you have any good idea for me? Also, I also tried to
> use bin/post to post an xml file to that same core and it worked so I'm
> wondering why the html file won't work. Thank you so much!! Since I don't
> know much about solr, it's really good that some one can help!
> 
> Best,
> Huiying
> 
> On Mon, Aug 3, 2015 at 1:54 PM, Erik Hatcher <erik.hatc...@gmail.com> wrote:
> 
>> My hunch is that the basic_configs is *too* basic for your needs here.
>> basic_configs does not include /update/extract - it’s very basic - stripped
>> of all the “extra” components.
>> 
>> Try using the default, data_driven_schema_configs instead.
>> 
>> If you’re still having issues, please provide full details of what you’ve
>> tried.
>> 
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com <http://www.lucidworks.com/>
>> 
>> 
>> 
>> 
>>> On Aug 3, 2015, at 1:43 PM, Huiying Ma <mahuiying...@gmail.com> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> I created a core with the basic config sets and schema, when I use
>> bin/post
>>> to post one html file, I got the error:
>>> 
>>> SimplePostTool: WARNING: IOException while reading response:
>>> java.io.FileNotFoundException......
>>> HTTP ERROR 404
>>> 
>>> when I go to localhost:8983/solr/core/update, I got:
>>> <response>
>>> <lst name="responseHeader">
>>> <int name="status">400</int>
>>> <int name="QTime">3<int>
>>> </lst>
>>> <lst name="error">
>>> <str name="msg">missing content stream</str>
>>> <int name="code">400</int>
>>> </lst>
>>> </response>
>>> 
>>> I'm really new to solr and wondering if anyone know how to index html
>> files
>>> according to my own schema and how to configure the schema.xml or
>>> solrconfig file. Thank you so much!
>>> 
>>> Thanks,
>>> Huiying
>> 
>>

Re: posting html files

Reply via email to