[
https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013974#comment-14013974
]
Nick Burch commented on TIKA-1315:
----------------------------------
In ListUtils, might be safest to make UNORDERED_LIST_CHAR use the unicode
escape sequence for that character, and maybe initialise the map up front
ListUtils seems to largely be based on someone else's work - do you have their
OK to contribute it to Tika? (We can only accept code that is either already
under an appropriate license, or willingly contributed by the author)
The unit test looks a little slim - any chance of something that checks in a
bit more detail? eg explicit entry checks, html level checks etc?
Can we do the same for XWPF / .docx files using the same / similar logic?
> Basic list support in WordExtractor
> -----------------------------------
>
> Key: TIKA-1315
> URL: https://issues.apache.org/jira/browse/TIKA-1315
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.6
> Reporter: Filip Bednárik
> Priority: Minor
> Fix For: 1.6
>
> Attachments: ListUtils.java, WordExtractor.java.patch,
> WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other
> way of contacting you and I don't quite understand how you manage forks and
> pull requests (I don't think you do that). Plus I don't know your coding
> styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc
> documents, but TIKA doesn't support it. So I looked for solution and found
> one here:
> http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/
> . So I adapted this solution to Apache TIKA with few fixes and improvements.
> Anyway feel free to use any of it so it can help people who struggle with
> lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils
--
This message was sent by Atlassian JIRA
(v6.2#6252)