Re: Records skipped when using DataImportHandler

Erick Erickson Mon, 08 Aug 2011 16:41:30 -0700

Spend some time in the admin/analysis page, that'll show you what
part of the analysis chain is doing what to your data. It'll save you a world
of headache...


But at a guess WordDelimiterFilterFactory is your culprit...

Best
Erick

On Thu, Aug 4, 2011 at 6:08 PM, anand sridhar <anand.for...@gmail.com> wrote:
> Ok. After analysis, I narrowed the reduced results set to the fact that the
> zipcode field is not indexed 'as is'. i.e the zipcode field values are
> broken down into tokens and then stored. Hence, if there are 10 documents
> with zipcode fields varying from 91000-91009, then the zipcode fields are
> not stored as 91000, 91001 etc.. instead, the most common recurrences are
> grabbed together and stored as tokens  hence resulting in a reduced
> resultset.
> The net effect is I cannot search for a value like 91000  since its not
> stored as it is.
>
> I suspect this to do something with the type of field the zipcode is
> associated to. Right now , zipcode is a field of type text_general where the
> StandardTokenizerFactory may be breakign the values into tokens. However, I
> want to store them without tokenizing. Whats the best field type to do this.
> ?
>
> I already explored the String fieldtype which is supposed to store the
> values as is, but I see that the values are still being tokenized.
>
>
> Thanks,
> Anand
> On Wed, Aug 3, 2011 at 7:24 PM, Erick Erickson <erickerick...@gmail.com>wrote:
>
>> Sorry, I'm on a restricted machine so can't get the precise URL. But,
>> there's a debug page for DIH that might allow you to see what the query
>> actually returns. I'd guess one of two things:
>> 1> you aren't getting the number of rows you think.
>> 2> you aren't committing the documents you add.
>>
>> But that's just a guess.
>>
>> Best
>> Erick
>> On Aug 3, 2011 2:15 PM, "anand sridhar" <anand.for...@gmail.com> wrote:
>> > Hi,
>> > I am a newbie to Solr and have been trying to learn using
>> > DataImportHandler.
>> > I have a query in data-config.xml that fetches about 5 records when i
>> fire
>> > it in SQL Query manager.
>> > However, when Solr does a full import, it is skipping 4 records and only
>> > importing 1 record.
>> > What could be the reason for that. ?
>> >
>> > My data-config.xml looks like this -
>> >
>> > <dataConfig>
>> > <dataSource type="JdbcDataSource"
>> > name="GeoService"
>> > driver="net.sourceforge.jtds.jdbc.Driver"
>> > url="jdbc:jtds:sqlserver://10.168.50.104/ZipCodeLookup"
>> > user="sa"
>> > password="psiuser"/>
>> > <document>
>> > <entity name="city"
>> > query="select ll.cityId as id, ll.zip as zipCode, c.cityName as
>> > cityName, st.stateName as state, ct.countryName as country from
>> latlonginfo
>> > ll,city c, state st, country ct where ll.cityId = c.cityID and
>> > c.stateID=st.stateID and st.countryID = ct.countryID
>> > order by ll.areacode"
>> > dataSource="GeoService">
>> > <field column="zipCode" name="zipCode"/>
>> > <field column="cityName" name="cityName"/>
>> > <field column="state" name="state"/>
>> > <field column="country" name="country"/>
>> > </entity>
>> > </document>
>> > </dataConfig>
>> >
>> > My fields definition in schema.xml looks as below -
>> >
>> > <field name="CityName" type="text_general" indexed="true" stored="true"
>> />
>> > <field name="zipCode" type="text_general" indexed="true" stored="true"/>
>> > <field name="state" type="text_general" indexed="true" stored="true" />
>> > <field name="country" type="text_general" indexed="true" stored="true" />
>> >
>> > One observation I made was the 1 record that is being indexes is the last
>> > record in the result set. I have verified that there are no duplicate
>> > records being retreived.
>> >
>> > For eg, if the result set from Database is -
>> >
>> > zipcode CityName state country
>> > ------- --------- ----- -------
>> > 91324 Northridge CA USA
>> > 91325 Northridge CA USA
>> > 91327 Northridge CA USA
>> > 91328 Northridge CA USA
>> > 91329 Northridge CA USA
>> > 91330 Northridge CA USA
>> >
>> > The record being indexed is the last record all the time.
>> >
>> > Any suggestions are welcome.
>> >
>> > Thanks,
>> > Anand
>>
>

Re: Records skipped when using DataImportHandler

Reply via email to