So I am trying to filter down what I am indexing, and the basic XPath
queries don't work. For example, working with tutorial.pdf this
indexes all the <div/>:
curl http://localhost:8983/solr/update/extract?ext.idx.attr=true
\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div
\&ext.literal.id=126\&ext.xpath=\/xhtml:html\/xhtml:body\/
descendant:node\(\) -F "tutori...@tutorial.pdf"
However, if I want to only index the first div, I expect to do this:
budapest:site epugh$ curl http://localhost:8983/solr/update/extract?ext.idx.attr=true
\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div
\&ext.literal.id=126\&ext.xpath=\/xhtml:html\/xhtml:body\/
xhtml:div[1] -F "tutori...@tutorial.pdf"
But I keep getting back an issue from curl. My attempts to escape the
[1] have failed. Any suggestions?
curl: (3) [globbing] error: bad range specification after pos 174
Eric
PS,
Also, this site seems to be okay as a place to upload your html and
practice xpath:
http://www.whitebeam.org/library/guide/TechNotes/xpathtestbed.rhtm
I did have to trip out the namespace stuff though.
-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal