from:"Mike Sokolov"

Re: Index not getting refreshed

2011-09-15 Thread Mike Sokolov

Is it possible you have two solr instances running off the same index folder? This was a mistake I stumbled into early on - I was writing with one, and reading with the other, so I didn't see updates. -Mike On 09/15/2011 12:37 AM, Pawan Darira wrote: I am commiting but not doing replication

Re: Field Collapsing and Record Filtering

2011-10-13 Thread Mike Sokolov

We have the identical problem in our system. Our plan is to encode the most recent version of a document using an explicit field/value; ie version=current (or maybe current=true) We also need to be able to allow users to search for the most current, but only within versions they have access

Re: In-document highlighting DocValues?

2011-10-13 Thread Mike Sokolov

Is there some reason you don't want to leverage Highlighter to do this work? It has all the necessary code for using the analyzed version of your query so it will only match tokens that really contribute to the search match. You might also be interested in LUCENE-2878 (which is still under d

Re: org.apache.pdfbox.pdmodel.PDPage Error

2011-10-25 Thread Mike Sokolov

On 10/24/2011 02:35 PM, MBD wrote: Is this really a stumper? This is my first experience with Solr and having spent only an hour or so with it I hit this barrier (below). I'm sure *I* am doing something completely wrong just hoping someone more familiar with the platform can help me identify&

Re: How to delete a SOLR document if that particular data doesnt exist in DB?

2010-10-20 Thread Mike Sokolov

Since you are performing a complete reload of all of your data, I don't understand why you can't create a new core, load your new data, swap your application to look at the new core, and then erase the old one, if you want. Even so, you could track the timestamps on all your documents, which

different results depending on result format

2010-10-21 Thread Mike Sokolov

I'm experiencing something really weird: I get different results depending on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I spent quite a while staring at query params to make sure everything else is the same, and they do seem to be. At first I thought the problem relat

Re: different results depending on result format

2010-10-21 Thread Mike Sokolov

quick follow-up: I also notice that the query from solrj gets version=1, whereas the admin webapp puts version=2.2 on the query string, although this param doesn't seem to change the xml results at all. Does this indicate an older version of solrj perhaps? -Mike On 10/21/2010 04:47 PM,

Re: different results depending on result format

2010-10-22 Thread Mike Sokolov

s? -Mike On 10/21/2010 04:47 PM, Mike Sokolov wrote: I'm experiencing something really weird: I get different results depending on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I spent quite a while staring at query params to make sure everything else is the sam

Re: different results depending on result format

2010-10-22 Thread Mike Sokolov

ck. I am looking into the virtual hosts config in tomcat; it seems as if there must indeed be another solr instance running; in fact I'm now concerned there might be two solr instances running against the same data folder. yargh. -Mike On 10/22/2010 09:05 AM, Mike Sokolov wrote: Yes - I r

Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov

Right - my point was to combine this with the previous approaches to form a query like: samsung AND android AND GPS AND word_count:3 in order to exclude documents containing additional words. This would avoid the combinatoric explosion problem otehrs had alluded to earlier. Of course this wou

Re: How do I this in Solr?

2010-10-27 Thread Mike Sokolov

seems like the only working idea. Maybe Varun could comment on the maximum numbers of terms that his queries will contain? Regards, Toke Eskildsen On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote: Right - my point was to combine this with the previous approaches to form a query like

Re: Query question

2010-11-03 Thread Mike Sokolov

Another alternative (prettier to my eye), would be: (city:Chicago AND Romantic AND View)^10 OR (Romantic AND View) -Mike On 11/03/2010 09:28 AM, kenf_nc wrote: Unfortunately the default operator is set to AND and I can't change that at this time. If I do (city:Chicago^10 OR Romantic OR Vi

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov

If your ranges are always contiguous, you could index two fields: range-start and range-end and then perform queries like: range-start:[* TO 30] AND range-end:[5 TO *] If you have multiple ranges which could have gaps in between then you need something more complicated :) On 02/27/2012 04:09

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov

I think your example case would end up like this: ... 1 -- single-valued range field 15 ... On 02/27/2012 04:26 PM, federico.wachs wrote: Michael thanks a lot for your quick answer, but i'm not exactly sure I understand your solution. How would the docuemnt you are proposin

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov

No; contiguous means there are no gaps between them. You need something like what you described initially. Another approach is to de-normalize your data so that you have a single document for every range. But this might or might not suit your application. You haven't said anything about the

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov

Yes, I see - I think your best bet is to index every day as a distinct value. Don't worry about having 100's of values. -Mike On 02/27/2012 05:11 PM, federico.wachs wrote: This is used on an apartment booking system, and what I store as solr documents can be seen as apartments. These apartmen

Re: Is there a way to implement a IntRangeField in Solr?

2012-02-27 Thread Mike Sokolov

I don't know if this would help with OOM conditions, but are you using a tint type field for this? That should be more efficient to search than a regular int or string. -Mike On 02/27/2012 05:27 PM, federico.wachs wrote: Yeah that's what I'm doing right now. But whenever I try to index an ap

Re: StreamingUpdateSolrServer - exceptions not propagated

2012-03-27 Thread Mike Sokolov

On 3/27/2012 11:14 AM, Mark Miller wrote: On Mar 27, 2012, at 10:51 AM, Shawn Heisey wrote: On 3/26/2012 6:43 PM, Mark Miller wrote: It doesn't get thrown because that logic needs to continue - you don't necessarily want one bad document to stop all the following documents from being added.

Re: Populating 'multivalue' fields (m:1 relationships)

2012-05-11 Thread Mike Sokolov

You can specify a solr field as "multi-valued", and then supply multiple values for it. What that really does is concatenate all the values with a positional gap between them to prevent phrases and other positional queries from traversing the boundary between the distinct values. -Mike On 05

creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov

I'm creating a some Solr plugins that index and search documents in a special way, and I'd like to make them as easy as possible to configure. Ideally I'd like users to be able to just drop a jar in place without having to copy any configuration into schema.xml, although I suppose they will ha

Re: creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov

ok, never mind all is well - I had a mismatch between the schema-declared field and my programmatic field, where I was overzealous in using OMIT_TF_POSITIONS. -Mike On 6/2/2012 5:02 PM, Mike Sokolov wrote: I'm creating a some Solr plugins that index and search documents in a special way

Re: creating SchemaField and FieldType programmatically

2012-06-02 Thread Mike Sokolov

()); setQueryAnalyzer(new WhitespaceGapAnalyzer()); } protected Field.Index getFieldIndex(SchemaField field, String internalVal) { return Field.Index.ANALYZED; } } On 6/2/2012 5:48 PM, Mike Sokolov wrote: ok, never mind all is well - I had a mismatch

Re: Efficiently mining or parsing data out of XML source files

2012-06-06 Thread Mike Sokolov

I agree, that seems odd. We routinely index XML using either HTMLStripCharFilter, or XmlCharFilter (see patch: https://issues.apache.org/jira/browse/SOLR-2597), both of which parse the XML, and we don't see such a huge speed difference from indexing other field types. XmlCharFilter also allo

highlighting field boundary detection

2012-06-19 Thread Mike Sokolov

Does anybody know of a way to detect when the highlight snippet begins at the beginning of the field or ends at the end of the field using one of the standard highlighters shipped w/Solr? We'd like to display ellipses only when there is additional text surrounding the snippet in the original

Re: multi-core solr, specifying the data directory

2011-03-02 Thread Mike Sokolov

Yes - I commented out the element in solrconfig.xml and then got the expected behavior: the core used a data subdirectory in the core subdirectory. It seems like the problem arises from using the solrconfig.xml that's distributed as example/solr/conf/solrconfig.xml The solrconfig.xml's in

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov

Suppose your analysis stack includes lower-casing, but your synonyms are only supposed to apply to upper-case tokens. For example, "PET" might be a synonym of "positron emission tomography", but "pet" wouldn't be. -Mike On 04/26/2011 09:51 AM, Robert Muir wrote: On Tue, Apr 26, 2011 at 12:24

Re: Automatic synonyms for multiple variations of a word

2011-04-26 Thread Mike Sokolov

cased... its just arbitrary that they are only being analyzed with the "tokenizer". On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov wrote: Suppose your analysis stack includes lower-casing, but your synonyms are only supposed to apply to upper-case tokens. For example, "

Re: Searching for escaped characters

2011-04-28 Thread Mike Sokolov

StandardTokenizer will have stripped punctuation I think. You might try searching for all the entity names though: (agrave | egrave | omacron | etc... ) The names are pretty distinctive. Although you might have problems with greek letters. -Mike On 04/28/2011 12:10 PM, Paul wrote: I'm tr

Re: Replicaiton Fails with Unreachable error when master host is responding.

2011-04-28 Thread Mike Sokolov

No clue. Try wireshark to gather more data? On 04/28/2011 02:53 PM, Jed Glazner wrote: Anybody? On 04/27/2011 01:51 PM, Jed Glazner wrote: Hello All, I'm having a very strange problem that I just can't figure out. The slave is not able to replicate from the master, even though the master is r

updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov

This is in 1.4 - we push updates via SolrJ; our application sees the updates, but when we use the solr admin screens to run test queries, or use Luke to view the schema and field values, it sees the database in its state prior to the commit. I think eventually this seems to propagate, but I'm

Re: updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov

Thanks - we are issuing a commit via SolrJ; I think that's the same thing, right? Or are you saying really we need to do a separate commit (via HTTP) to update the admin console's view? -Mike On 05/02/2011 11:49 AM, Ahmet Arslan wrote: This is in 1.4 - we push updates via SolrJ; our applica

Re: updates not reflected in solr admin

2011-05-02 Thread Mike Sokolov

Ah - I didn't expect that. Thank you! On 05/02/2011 12:07 PM, Ahmet Arslan wrote: Thanks - we are issuing a commit via SolrJ; I think that's the same thing, right? Or are you saying really we need to do a separate commit (via HTTP) to update the admin console's view? Yes separate commit is

Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov

I think the key question here is what's the best way to perform indexing without affecting search performance, or without affecting it much. If you have a batch of documents to index (say a daily batch that takes an hour to index and merge), you'd like to do that on an offline system, and then

Re: how to do offline adding/updating index

2011-05-10 Thread Mike Sokolov

Thanks - that sounds like what I was hoping for. So the I/O during replication will have *some* impact on search performance, but presumably much less than reindexing and merging/optimizing? -Mike Master/slave replication does this out of the box, easily. Just set the slave to update on Opti

Re: What is correct use of HTMLStripCharFilter in Solr 3.1

2011-05-12 Thread Mike Sokolov

It preserves the location of the terms in the original HTML document so that you can highlight terms in HTML. This makes it possible (for instance) to display the entire document, with all the search terms highlighted, or (with some careful surgery) to display formatted HTML (bold, italic, etc

document storage

2011-05-13 Thread Mike Sokolov

Would anyone care to comment on the merits of storing indexed full-text documents in Solr versus storing them externally? It seems there are three options for us: 1) store documents both in Solr and externally - this is what we are doing now, and gives us all sorts of flexibility, but doesn't

Re: document storage

2011-05-16 Thread Mike Sokolov

On 05/15/2011 11:48 AM, Erick Erickson wrote: Where are the documents coming from? Because storing them ONLY in Solr risks losing them if your index is somehow hosed. In our case, we generally have source documents and can reproduce the index if need be, but that's a good point. Storing the

Re: boolean versus non-boolean search

2011-05-16 Thread Mike Sokolov

On 05/16/2011 09:24 AM, Dmitry Kan wrote: Dear list, Might have missed it from the literature and the list, sorry if so, but: SOLR 1.4.1 Consider the query: term1 term2 OR "term1 term2" OR "term1 term3" I think what's happening is that your query gets rewritten into something like:

Re: [POLL] How do you (like to) do logging with Solr

2011-05-16 Thread Mike Sokolov

We use log4j explicitly and find it irritating to deal with the built-in JDK logging default. We also have conflicts with other packages that have their own ideas about how to bind slf4j, so the less of this the better, IMO. The 1.6.1 no-op default behavior seems a bit unfortunate as out-of-t

Re: Storing, indexing and searching XML documents in Solr

2011-05-18 Thread Mike Sokolov

You might want to create a field that's analyzed using HtmlStripCharFilter - this will index all the non-tag/non-attribute text in the document, and if you store the value, will store the entire XML document as well. I've done some work on an XmlStripCharFilter, which does the same thing (onl

Re: [Contribution] Multiword Inline-Prefix Autocomplete Idea

2011-05-20 Thread Mike Sokolov

Cool! suggestion: you might want to replace externalVal.toLowerCase().split(" "); with externalVal.toLowerCase().split("\\s+"); also I bet folks might have different ideas about what to do with hyphens, so maybe: externalVal.toLowerCase().split("[-\\s]+"); In fact why not make it a config

Re: Solr Highlight Component

2011-05-24 Thread Mike Sokolov

A possible workaround is to re-fetch the documents in your result set with a query that is: +id=(id1 or id2 or ... id20) () where id1..20 are the doc ids in your result set would require two round-trips though -Mike On 05/24/2011 08:19 AM, Koji Sekiguchi wrote: (11/05/24 20:56), Lord Khan H

Re: solr Invalid Date in Date Math String/Invalid Date String

2011-05-27 Thread Mike Sokolov

The "*" endpoint for range terms wasn't implemented yet in 1.4.1 As a workaround, we use very large and very small values. -Mike On 05/27/2011 12:55 AM, alucard001 wrote: Hi all I am using SOLR 1.4.1 (according to solr info), but no matter what date field I use (date or tdate) defined in def

Re: Obtaining query AST?

2011-05-31 Thread Mike Sokolov

I believe there is a query parser that accepts queries formatted in XML, allowing you to provide a parse tree to Solr; perhaps that would get you the control you're after. -Mike On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote: Hi, I want to write my own query expander. It needs to obtain

Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov

Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, bu

Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov

opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope so

Re: Text field case sensitivity problem

2011-06-15 Thread Mike Sokolov

works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov <mailto:soko...

Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov

I'd be very interested in this, as well, if you do it before me and are willing to share... A related question I have tried to ask on this list, and have never really gotten a good answer to, is whether it makes sense to just chuck the external storage and treat the lucene index as the primary

Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov

like that. Thoughts/comments? On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov <mailto:soko...@ifactory.com>> wrote: I'd be very interested in this, as well, if you do it before me and are willing to share... A related question I have tried to ask on this list, and ha

Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread Mike Sokolov

e should include but that's all I would need at present. On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov mailto:soko...@ifactory.com>> wrote: Another option for determining whether to go to external storage would be to examine the SchemaField, see if it

Re: MultiValued facet behavior question

2011-06-22 Thread Mike Sokolov

On 06/22/2011 04:01 AM, Dennis de Boer wrote: Hi Bill, as far as I understood now, with the help of my friend, you can't. Multivalued fields don't work that way. You can however always filter the facet results manually in the JSP. You knwo what the user chose as a facet. Yes - that is the m

Re: MultiValued facet behavior question

2011-06-22 Thread Mike Sokolov

We always remove the facet filter when faceting: in other words, for a good user experience, you generally want to show facets based on the query excluding any restriction based on the facets. So in your example (facet B selected), we would continue to show *all* facets. Only if you performed a

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov

Actually - you are both wrong! It is true that 0x is a valid UTF8 character, and not a valid UTF8 byte sequence. But the parser is reporting (or trying to) that 0x is an invalid XML character. And Robert - if the wording offends you, you might want to send a note to Tatu (http://ji

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov

issues like this is to wrap the stream I'm handing to the parser in some kind of cleanup stream that handles a few yucky issues. You could, eg, just strip out invalid XML characters. Maybe Nutch should be doing this, or at least handling the error better? -Mike On 06/27/2011 09:19 AM,

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov

I don't think this is a BOM - that would be 0xfeff. Anyway the problem we usually see w/processing XML with BOMs is in UTF8 (which really doesn't need a BOM since it's a byte stream anyway), in which if you transform the stream (bytes) into a reader (chars) before the xml parser can see it, th

Re: Solr 3.1 indexing error Invalid UTF-8 character 0xffff

2011-06-27 Thread Mike Sokolov

Markus - if you want to make sure not to offend XML parsers, you should strip all characters not in this list: http://en.wikipedia.org/wiki/XML#Valid_characters You'll see that article talks about XML 1.1, which accepts a wider range of characters than XML 1.0, and I believe the Woodstox parse

Re: Looking for Custom Highlighting guidance

2011-06-29 Thread Mike Sokolov

Does the phonetic analysis preserve the offsets of the original text field? If so, you should probably be able to hack up FastVectorHighlighter to do what you want. -Mike On 06/29/2011 02:22 PM, Jamie Johnson wrote: I have a schema with a text field and a text_phonetic field and would like t

Re: Looking for Custom Highlighting guidance

2011-06-30 Thread Mike Sokolov

t. Having no familiarity with FastVectorHighlighter is there somewhere specific I should be looking? On Wed, Jun 29, 2011 at 3:20 PM, Mike Sokolov wrote: Does the phonetic analysis preserve the offsets of the original text field? If so, you should probably be able to hack up FastVectorHigh

Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov

;m not familiar with the CharFilters, I'll look into those now. Is the solr.LowerCaseFilterFactory not handling wildcards the expected result or is this a bug? On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov wrote: I wonder whether CharFilters are applied to wildcard terms? I suspect

Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov

lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov wrote:

Re: TermVectors and custom queries

2011-07-01 Thread Mike Sokolov

Yes, that's right. But at the moment the HL code basically has to reconstruct and re-run your query - it doesn't have any special knowledge. There's some work going on to try and fix that, but it seems like it's going to require some fairly major deep re-plumbing. -Mike On 07/01/2011 07:54

Re: How do I add a custom field?

2011-07-07 Thread Mike Sokolov

Did you ever commit? On 07/07/2011 01:58 PM, Gabriele Kahlout wrote: so, how about this: Document doc = searcher.doc(i); // i get the doc doc.removeField("wc"); // remove the field in case there's addWc(doc, docLength); //add the new field writer.updateDocumen

Re: How do I specify a different analyzer at search-time?

2011-07-11 Thread Mike Sokolov

There is a syntax that allows you to specify different analyzers to use for indexing and querying, in solr.xml. But if you don't do that, it should use the same analyzer in both cases. -Mike On 07/11/2011 10:58 AM, Gabriele Kahlout wrote: With a lucene QueryParser instance it's possible to

Re: strip html from data

2011-07-25 Thread Mike Sokolov

I think you need to list the charfilter earlier in the analysis chain; before the tokenizer. Porbably Solr should tell you this... -Mike On 07/25/2011 09:03 AM, Merlin Morgenstern wrote: sounds logical. I just changed it to the following, restarted and reindexed with commit:

Re: strip html from data

2011-07-25 Thread Mike Sokolov

Hmm - I'm not sure about that; see https://issues.apache.org/jira/browse/SOLR-2119 On 07/25/2011 12:01 PM, Markus Jelsma wrote: charFilters are executed first regardless of their position in the analyzer. On Monday 25 July 2011 17:53:59 Mike Sokolov wrote: I think you need to lis

Re: strip html from data

2011-07-25 Thread Mike Sokolov

keyword false false typewordword startOffset 6 10 endOffset 9 13 On Monday 25 July 2011 18:07:29 Mike Sokolov wrote: Hmm - I'm not sure about that; see https://issues.apache.org/jira/browse/SOLR-2119 On 07/25/2011 12:01 PM, Markus Jelsma wrote:

Re: slow highlighting because of stemming

2011-07-29 Thread Mike Sokolov

I'm not sure I would identify stemming as the culprit here. Do you have very large documents? If so, there is a patch for FVH committed to limit the number of phrases it looks at; see hl.phraseLimit, but this won't be available until 3.4 is released. You can also limit the amount of each doc

ideas for versioning query?

2011-08-01 Thread Mike Sokolov

A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a different set of document versions, and each user should see only the most recent v

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov

d). Regards, Tomás On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolov wrote: A customer has an interesting problem: some documents will have multiple versions. In search results, only the most recent version of a given document should be shown. The trick is that each user has access to a diffe

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov

local test dataset (~30M docs, ~8000 groups) and my machine. You might encounter different search times when setting group.ngroup=true. Martijn 2011/8/1 Mike Sokolov Thanks, Tomas. Yes we are planning to keep a "current" flag in the most current document. But there are cases whe

Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Mike Sokolov

If you want to avoid re-indexing, you could consider building a synonym file that is generated using your rule set, and then using that to expand your queries. You'd need to get a list of all terms in your index and then process them to generate synyonyms. Actually, I don't know how to get a

Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread Mike Sokolov

You have a few choices: 1) flatten your field structure - like your "undesirable" example, but wouldn't you want to have the document identifier as a field value also? 2) use phrase queries to make sure the key/value pairs are adjacent 3) use a join query That's all I can think of -Mike On

Re: Strategies for sorting by array, when you can't sort by array?

2011-08-03 Thread Mike Sokolov

Although you weren't very clear about it, it sounds as if you want the results to be sorted by a name that actually matched the query? In general that is not going to be easy, since it is not something that can be computed in advance and thus indexed. -Mike On 08/03/2011 10:39 AM, Olson, Ro

solr-user@lucene.apache.org

2012-07-10 Thread Mike Sokolov

I don't have any experience with DIH: maybe XPathEntityProcessor doesn't use a true XML parser? You might want to try passing your documents through "xmllint -noent" (basically parse and reserialize) - that should inline the characters as UTF-8? On 07/09/2012 03:18 PM, Michael Belenki wrote:

solr-user@lucene.apache.org

2012-07-11 Thread Mike Sokolov

I think the issue here is that DIH uses Woodstox "BasicStreamReader" (see http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html) which has only minimal DTD support. It might be best to use ValidatingStreamReader (http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/w

75 matches

Mail list logo