Re: query parsing

Mark Fenbers Sun, 27 Sep 2015 11:01:04 -0700

I am delighted to announce that I have it all working again! Well, notall, just the searching!

I deleted my core and created a new one from the command-line (solrcreate_core -c EventLog2) using the basic_configs option. Then I had toadd my columns to the schema.xml and the dataimport handler tosolrconfig.xml and tweak a couple of other details. But to make a longstory short, parsing is working and I can search on terms withoutwrapping asterisks!! Yay! Thanks for the help!

Spell-checking still isn't working, though, and I'm apprehensive aboutworking with it today. But I will eventually. The complaint is itcan't find ELspell, which I had defined in the old setup that I blewaway, so I'll have to redefine it at some point! For now, I'm justgonna delight in having searching working again!


Mark

On 9/26/2015 11:05 PM, Erick Erickson wrote:

No need to re-install Solr, just create a new core, this time it'd probably be
easiest to use the bin/solr create_core command. In the Solr
directory just type bin/solr create_core -help to see the options.

We're pretty much trying to migrate to using bin/solr for all the maintenance
we can, but as always the documentation lags the code.

Yeah, things are a bit ragged. The admin UI/core UI is really a legacy
bit of code that has _always_ been confusing, I'm hoping we can pretty
much remove it at some point since it's as trappy as it is.

Best,
Erick

On Sat, Sep 26, 2015 at 12:49 PM, Mark Fenbers <mark.fenb...@noaa.gov> wrote:

OK, a lot of dialog while I was gone for two days!  I read the whole thread,
but I'm a newbie to Solr, so some of the dialog was Greek to me.  I
understand the words, of course, but applying it so I know exactly what to
do without screwing something else up is the problem.  After all, that is
how I got into the mess in the first place.  I'm glad I have good help to
untangle the knots I've made!

I'd like to start over (option 1 below), but does this mean delete all my
config and reinstalling Solr??  Maybe that is not a bad idea, but I will at
least save off my data-config.xml as that is clearly the one thing that is
probably working right.  However, I did do quite a bit of editing that I
would have to do again. Please advise...

To be fair, I must answer Erick's question of how I created the data index
in the first place, because this might be relevant...

The bulk of the data is read from 9000+ text files, where each file was
manually typed.  Before inserting into the database, I do a little bit of
processing of the text using "sed" to delete the top few and bottom few
lines, and to substitute each single-quote character with a pair of
single-quotes (so PostgreSQL doesn't choke).  Line-feed characters are
preserved as ASCII 10 (hex 0A), but there shouldn't be (and I am not aware
of) any characters aside from what is on the keyboard.

Next, I insert it with this command:
psql -U awips -d OHRFC -c "INSERT INTO EventLogText VALUES('$postDate',
'$user', '$postDate', '$entryText', '$postCatVal');"

In case you are wondering about my table, it is defined in this way:
CREATE TABLE eventlogtext (
   posttime timestamp without time zone NOT NULL, -- Timestamp of this
entry's original posting
   username character varying(8), -- username (logname) of the original
poster
   lastmodtime timestamp without time zone, -- Last time record was altered
   logtext text, -- text of the log entry
   category integer, -- bit-wise category value
   CONSTRAINT eventlogtext_pkey PRIMARY KEY (posttime)
)

To do the indexing, I merely use /dataimport?full-import, but it knows what
to do from my data-config.xml; which is here:

<dataConfig>
     <dataSource driver="org.postgresql.Driver"
url="jdbc:postgresql://dx1f/OHRFC" user="awips" />
     <document>
         <entity name="eventlogtext" query="SELECT posttime AS id, username,
logtext, category FROM eventlogtext;"
                 deltaQuery="SELECT posttime AS id FROM eventlogtext WHERE
lastmodtime > '${dataimporter.last_index_time}';">
             <entity name="categorytypes" query="SELECT catname FROM
categorytypes WHERE catid='${eventlogtext.category}';">
             </entity>
         </entity>
     </document>
</dataConfig>

Hope this helps!

Thanks,
Mark

On 9/24/2015 10:57 AM, Erick Erickson wrote:

Geraint:

Good Catch! I totally missed that. So all of our focus on schema.xml has
been... totally irrelevant. Now that you pointed that out, there's also
the
addition: add-unknown-fields-to-the-schema, which indicates you started
this up in "schemaless" mode.

In short, solr is trying to guess what your field types should be and
guessing wrong (again and again and again). This is the classic weakness
of
schemaless. It's great for indexing stuff fast, but if it guesses wrong
you're stuck.


So to the original problem: I'd start over and either
1> use the regular setup, not schemaless
or
2> use the _managed_ schema API to explicitly add fields and fieldTypes to
the managed schema

Best,
Erick

On Thu, Sep 24, 2015 at 2:02 AM, Duck Geraint (ext) GBJH <
geraint.d...@syngenta.com> wrote:

Okay, so maybe I'm missing something here (I'm still relatively new to
Solr myself), but am I right in thinking the following is still in your
solrconfig.xml file:

    <schemaFactory class="ManagedIndexSchemaFactory">
      <bool name="mutable">true</bool>
      <str name="managedSchemaResourceName">managed-schema</str>
    </schemaFactory>

If so, wouldn't using a managed schema make several of your field
definitions inside the schema.xml file semi-redundant?

Regards,
Geraint


Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com


-----Original Message-----
From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
Sent: 24 September 2015 09:23
To: solr-user@lucene.apache.org
Subject: Re: query parsing

I would focus on this :

"

5> now kick off the DIH job and look again.

Now it shows a histogram, but most of the "terms" are long -- the full
texts of (the table.column) eventlogtext.logtext, including the
whitespace
(with %0A used for newline characters)...  So, it appears it is not being
tokenized properly, correct?"
Can you open from your Solr ui , the schema xml and show us the snippets
for that field that seems to not tokenise ?
Can you show us ( even a screenshot is fine) the schema browser page
related ?
Could be a problem of encoding ?
Following Erick details about the analysis, what are your results ?

Cheers

2015-09-24 8:04 GMT+01:00 Upayavira <u...@odoko.co.uk>:

typically, the index dir is inside the data dir. Delete the index dir
and you should be good. If there is a tlog next to it, you might want
to delete that also.

If you dont have a data dir, i wonder whether you set the data dir
when creating your core or collection. Typically the instance dir and
data dir aren't needed.

Upayavira

On Wed, Sep 23, 2015, at 10:46 PM, Erick Erickson wrote:

OK, this is bizarre. You'd have had to set up SolrCloud by
specifying the -zkRun command when you start Solr or the -zkHost;
highly unlikely. On the admin page there would be a "cloud" link on
the left side, I really doubt one's there.

You should have a data directory, it should be the parent of the
index and tlog directories. As of sanity check try looking at the
analysis page.
Type
a bunch of words in the left hand side indexing box and uncheck the
verbose box. As you can tell I'm grasping at straws. I'm still
puzzled why you don't have a "data" directory here, but that
shouldn't really matter. How did you create this index? I don't mean
data import handler more how did you create the core that you're
indexing to?

Best,
Erick

On Wed, Sep 23, 2015 at 10:16 AM, Mark Fenbers
<mark.fenb...@noaa.gov>
wrote:

On 9/23/2015 12:30 PM, Erick Erickson wrote:

Then my next guess is you're not pointing at the index you think
you

are

when you 'rm -rf data'

Just ignore the Elall field for now I should think, although get
rid

of it

if you don't think you need it.

DIH should be irrelevant here.

So let's back up.
1> go ahead and "rm -fr data" (with Solr stopped).

I have no "data" dir.  Did you mean "index" dir?  I removed 3
index directories (2 for spelling):
cd /localapps/dev/eventLog; rm -rfv index solr/spFile solr/spIndex

2> start Solr
3> do NOT re-index.
4> look at your index via the schema-browser. Of course there
4> should

be

nothing there!

Correct!  It said "there is no term info :("

5> now kick off the DIH job and look again.

Now it shows a histogram, but most of the "terms" are long -- the
full texts of (the table.column) eventlogtext.logtext, including
the

whitespace

(with %0A used for newline characters)...  So, it appears it is
not

being

tokenized properly, correct?

Your logtext field should have only single tokens. The fact that
you

have

some very
long tokens presumably with whitespace) indicates that you aren't

really

blowing
the index away between indexing.

Well, I did this time for sure.  I verified that initially,
because it showed there was no term info until I DIH'd again.

Are you perhaps in Solr Cloud with more than one replica?

Not that I know of, but being new to Solr, there could be things
going

on

that I'm not aware of.  How can I tell?  I certainly didn't set

anything up

for solrCloud deliberately.

In that case you
might be getting the index replicated on startup assuming you
didn't blow away all replicas. If you are in SolrCloud, I'd just
delete the collection and start over, after insuring that you'd
pushed the configset up to Zookeeper.

BTW, I always look at the schema.xml file from the Solr admin
window

just

as
a sanity check in these situations.

Good idea!  But the one shown in the browser is identical to the
one

I've

been editing!  So that's not an issue.


--
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England
________________________________


Syngenta Limited, Registered in England No 2710846;Registered Office :
Syngenta Limited, European Regional Centre, Priestley Road, Surrey
Research
Park, Guildford, Surrey, GU2 7YH, United Kingdom
________________________________
   This message may contain confidential information. If you are not the
designated recipient, please notify the sender immediately, and delete
the
original and any copies. Any use of the message by you is prohibited.

Re: query parsing

Reply via email to