Thank you so much for your help... I will try it...
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-files-in-SOLR-tp896530p910555.html
Sent from the Solr - User mailing list archive at Nabble.com.
Ah! You need a SolrJ program that uses Tika to parse the files and
upload the text. I think there is such a program already but do not
know where it is.
Lance
On Thu, Jun 17, 2010 at 6:13 AM, seesiddharth wrote:
>
> Thank you so much for the reply...The link suggested by you is helpful but
> the
Thank you so much for the reply...The link suggested by you is helpful but
they have explain everything with use of curl command which I don't want to
use.
I was more interested in uploading the .html documents using HTTP web
request.
So I have stored all .html files at one location & then create
This is the tool in Solr for indexing various kinds of content. After
you learn the basics of indexing (see solr/example/exampledocs for
samples), the ExtractingRequestHandler will make sense:
http://wiki.apache.org/solr/ExtractingRequestHandler
On Tue, Jun 15, 2010 at 12:35 AM, seesiddharth wro
Looking at it again, there appears to be only one HTML stripper. Your
alternative is to use the regex PatternReplace stuff with some custom
patterns. Ok make a stopword list of all html keywords.
On Thu, Jun 10, 2010 at 8:00 AM, Blargy wrote:
>
> Do I even need to tidy/clean up the html if I use
Do I even need to tidy/clean up the html if I use the
HTMLStripCharFilterFactory?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html
Sent from the Solr - User mailing list archive at Nabble.com.
On Jun 9, 2010, at 8:38pm, Blargy wrote:
What is the preferred way to index html using DIH (my html is stored
in a
blob field in our database)?
I know there is the built in HTMLStripTransformer but that doesn't
seem to
work well with malformed/incomplete HTML. I've created a custom
tra
Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at
index time?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
--
View this message in context:
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html
Sent from
Does the HTMLStripChar apply at index time or query time? Would it matter to
use over the other?
As a side question, if I want to perform highlighter summaries against this
field do I need to store the whole field or just index it with
TermVector.WITH_POSITIONS_OFFSETS?
--
View this message in
The HTMLStripChar variants are newer and might work better.
On Wed, Jun 9, 2010 at 8:38 PM, Blargy wrote:
>
> What is the preferred way to index html using DIH (my html is stored in a
> blob field in our database)?
>
> I know there is the built in HTMLStripTransformer but that doesn't seem to
> w
Thank you! That's even more I wanted to know. ;)
Georg
On Tue, Mar 2, 2010 at 10:05 PM, Walter Underwood wrote:
> You are in luck, because Avi Rappoport has just written a tutorial about
> how to do this. It is available from Lucid Imagination:
>
>
> http://www.lucidimagination.com/solutions/wh
You are in luck, because Avi Rappoport has just written a tutorial about how to
do this. It is available from Lucid Imagination:
http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr
I've just started reviewing it, but knowing Avi, I expect it to be very he
There is an HTML filter documented here, which might be of some help -
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
Control characters can be eliminated using code like this -
http://bitbucket.org/cogtree/python-solr/src/tip/pythonsolr/pysolr.py#cl-44
amount of
string processing it does, the fact that it is a Reader probably does not
affect its performance.
Cheers,
Lance
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 22, 2008 10:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing HTML
John,
Solr already has some of this stuff:
$ ff \*HTML\*java
./src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
./src/java/org/apache/solr/analysis/HTMLStripStandardTokenizerFactory.java
./src/java/org/apache/solr/analysis/HTMLStripReader.java
./src/java/org/apache/solr/analysis/HTMLStr
Actually, it's very easy: http://us2.php.net/strip_tags
I also store the data in a separate field with the html intact for
display. In that case, I use urlencode on the string.
David
McBride, John wrote:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Up
Hi,
Maybe this one?
http://htmlparser.sourceforge.net/
/Jimi
Quoting "McBride, John" <[EMAIL PROTECTED]>:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Upon indexing these the html gets stored along with the content of the
article, which is undesirable.
On 3-Oct-07, at 3:26 AM, Ravish Bhagdev wrote:
Because of this I cannot present the resulting html in a webpage. Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead? But
doesn't Solr use HTML features while searchin
Hi Erik, All,
I escaped HTML text into entities before sending to Solr and indexing
went fine. The problem now is that when I get back a snippet with
highlighted text for the html field, its not well formed as the
highliting dosen't somtimes include the entire tag if present. For
e.g.:
−
On Aug 27, 2007, at 10:00 AM, Michael Kimsal wrote:
What's odd about this is that the error seems to indicate that I did.
Actually the error message looks like you escaped too much. You
should _not_ escape , only the contents of it.
Erik
The full text (minus the stack trace)
What's odd about this is that the error seems to indicate that I did.
The full text (minus the stack trace) was
org.xmlpull.v1.XmlPullParserException: parser must be on START_TAG or TEXT
to read text (position: START_TAG seen ...... @4:37)
Or is that just a by
Michael,
I think the issue is that you're not escaping the values.
Send something like this to Solr instead:
linktext
a>
Erik
On Aug 27, 2007, at 9:29 AM, Michael Kimsal wrote:
Hello
I'm trying to index individual lines of an HTML file, a
I think you can use the HTMLStripWhitespaceTokenizerFactory.
Look here :
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e
I hope this helps
On 27/08/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:
>
> Hello
>
> I'm trying to index individu
Thanks Jérôme!
It seems to work now. I just hope the provided
HTMLStripWhitespaceTokenizerFactory will strip the right tags now.
I use Java and used HtmlEncoder provided in
http://itext.ugent.be/library/api/ for encoding with success. (just
in case someone happens to search this thread)
Ravi
You need to encode your html content so it can be include as a normal
'string' value in your xml element.
As far as remember, the only unsafe characters you have to encode as
entities are:
< -> <
> -> >
" -> "e;
& -> &
(google xml entities to be sure).
I dont know what language you use , but fo
he.org
Sent: Friday, July 6, 2007 2:19:21 AM
Subject: Re: Indexing HTML and other doc types
I guess I misread your original question. I believe Nutch would be the
choice for crawling, however I do not know about its abilities for indexing
other document types. If you needed to index multiple do
Peter,
I was playing with Nutch for quite some time before Solr, so
I know Nutch better than Solr. Nutch has a plugin mechanism
so that you can add a parser for a document type. It comes with
parser plugins for most popular doc types (with varying degrees of
international text support).
My que
I guess I misread your original question. I believe Nutch would be the
choice for crawling, however I do not know about its abilities for indexing
other document types. If you needed to index multiple document types such
as PDF, DOC, etc and Nutch does not provide functionality to do so you woul
Thank you, Otis and Peter, for your replies.
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> doc of some type -> parse content into various fields -> post to Solr
I understand this part, but the question is who should do this.
I was under assumption that it's Solr client's job to crawl the
A coworker of mine posted the code that we used for adding pdf, doc, xls,
etc documents into solr. You can find the files at the following location.
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
Just apply the patch and put the
Kuro,
doc of some type -> parse content into various fields -> post to Solr
Even Nutch does the same - there is a title field, a content field, and so on
(the exact names may be different).
Of course, you can always just combine everything into a single content field.
Otis
. . . . . . . . . .
31 matches
Mail list logo