Hi,
I wish to index well formed xml documents as they are.
I have a database filled with MARCXML records. An example of these looks like
this:
http://www.loc.gov/MARC21/slim
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
xmlns="http://www.loc.gov/MARC21
> SOLR has of course a problem with the XML in the 'originalRecord' field.
> Is there a solution to this? Has anyone done this before?
I would suggest changing the field type of "originalRecord" to "string"
rather than "text", and if you're still having trouble with the XML data
simply encapsulat
One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities? I did this and HTMLStripper
doesn't seem to recognise them the tags :-S While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.
Thanks Adrian, I'm very new to Solr myself so struggling a bit in
initial stages...
One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities? I did this and HTMLStripper
doesn't seem to recognise them the tags :-S While if I try and input
HTML as i
On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
(Query esp. Adrian):
If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation? What's
the best way out?
Hello Benoit,
An additonal thing to check out is the work being done on fac-back-opac.
They have a parser that will parse native MARC records.
I would assume that if you can extract your records in MARC XML you can
extract them in native MARC.
I've used the parser and it works well.
al
On Fri
Benoit,
Are you familiar with the Vufind project (http://www.vufind.org)? If you
look at the PHP code in the import folder to see how the indexing is
working (there's an XSL transformation that then updates the index).
I've also written some initial code to use embedded Solr to do this
indexing di
Solr is not an XML engine (or a MARC engine). It uses XML as an input format
for fielded data. It does not index or search arbitrary XML. You need to
convert your XML into Solr's format.
I would recommend expressing MARC in a Solr schema, then working on the
input XML. The input XML depends on the
That is one seriously manly regex, but I'd recommend using the Tag Soup
parser instead:
http://ccil.org/~cowan/XML/tagsoup/
wunder
On 10/4/07 10:11 PM, "J.J. Larrea" <[EMAIL PROTECTED]> wrote:
> It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or
> XML-like tags:
>
Adrian Sutton wrote:
> We didn't do anything at all to the HTML, the editor returns valid XHTML
> (using numeric entities, never named entities which aren't valid in XML
> and don't tend to work in XHTML) [...]
Named entity references are valid in XML. They just need to be declared
before they ar
At 9:32 PM +1000 10/5/07, Adrian Sutton wrote:
>From what people are suggesting though you'd be better off converting to plain
>text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net)
>can parse most HTML that's around and you can iterate over the DOM to extract
>the text f
Is there any way to merge fields during indexing time.
I have field1 and field2 and would like to combine these fields and make
field3.
In the document, there are field1 and field2, and I may build field3 using
CopyField.
Thanks,
Jae
Jae,
The easiest way to do this is with CopyField.
These entries in your schema will accomplish that:
Field 3 will have the tokens from both field 1 and 2 in it.
If you want to merge those 2 fields for display, I would just concat
them at display time.
Dave
-Original Message
: Although I haven't tried yet, I can't imagine that this request returns in
: sub-zero seconds, which is what I want (having a index of about 1M docs with
: 6000 fields/ doc and about 10 complex facetqueries / request).
i wouldn't neccessarily assume that :)
If you have a request handler whi
I'm having a problem with sorting on a certain field. In my schema.xml
it's defined as a string (not analyzed, indexed/stored verbatim). But
when I look at my results (sorted on that field ascending) I get
things like the following:
Yr City's A Sucker
Movement b/w Yr City's A Sucker
X, Y & Sometim
can you post...
* the fieldtype declaration from your schema.xml
* the field declaration from your schema
* the full URL that generated that ordering
* the full XML output from that URL
(you can set the "fl" param to just be the field you are sorting on and
score if the XML response is real
Thanks all for very valuable contributions, I understand these aspects
of Solr much better now
but...
>But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
> Paris
>which would have to be accomplished some other way.
Yes, exactly. And
Hi,
I'm trying to find a way to express a certain query and wondering if
anyone could help.
The query is against a schema that stores the user_ids who have worked
on each document in a multi-value integer field called 'user_ids'. I'd
like to query solr for all documents that anyone other th
On 5-Oct-07, at 11:59 AM, Ravish Bhagdev wrote:
But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
Parisspan>
which would have to be accomplished some other way.
Yes, exactly. And I think nutch handles this somehow as I remember
A gotcha here is that creates multiple values. Each field copied
in becomes a separate field. If you wanted a single-valued field this will
not work.
Lance Norskog
-Original Message-
From: Keene, David [mailto:[EMAIL PROTECTED]
Sent: Friday, October 05, 2007 10:50 AM
To: solr-user@luce
Sorry, user error. In the example I posted the field type was actually
not string. But I was getting confused on another field because I
didn't realize that string was case sensitive. Too many fields to
think about! :)
On 10/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> can you post...
>
>
Howdy all,
We are attempting to provide access to about 8 million records of
highly variable quality and length. In a nutshell, we are trying to
find a way to deprioritize "suspect" records without discriminating
against useful records that happen to be short. We do not wish to
eliminate suspect r
On 5-Oct-07, at 2:06 PM, Kyle Banerjee wrote:
Howdy all,
We are attempting to provide access to about 8 million records of
highly variable quality and length. In a nutshell, we are trying to
find a way to deprioritize "suspect" records without discriminating
against useful records that happen t
> If you know at index time that the document is shady, the easiest way
> to de-emphasize it globally is to set the document boost to some
> value other than one.
>
> ...
I considered that, but assumed we'd get the values wrong at first and
have to do a lot of tinkering before we got it right. Is
On 5-Oct-07, at 3:01 PM, Kyle Banerjee wrote:
If you know at index time that the document is shady, the easiest way
to de-emphasize it globally is to set the document boost to some
value other than one.
...
I considered that, but assumed we'd get the values wrong at first and
have to do a lot
On 10/5/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> The other option is to use a function query on the value stored in a
> field (which could represent a range of 'badness'). This can be used
> directly in the dismax handler using the bf (boost function) query
> parameter.
In the near future, you
Named entity references are valid in XML. They just need to be
declared
before they are used[1], unless they are one of the builtin named
entities < > ' " or & -- these are always valid
when
parsing with an XML parser.
Correct, it was an offhand comment and I skipped over all the
details
Dave,
Have you tried using &debugQuery=true ? :)
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: "Keene, David" <[EMAIL PROTECTED]>
To: Teruhiko Kurosaka <[EMAIL PROTECTED]>
Cc: so
28 matches
Mail list logo