Thanks everyone, It works! I have successfully indexed them. Thanks again!

I have couple of more questions regarding with solr, if you don't mind.

1-) As I said before, the text files are quite large, between
100kb-10mb, but I need to store them as well for highlighting,
including with their title, description, tags (I concat tags while
fetching from the db, and treat them as one row). For search result on
the page, I have to get;

username  (string)
lang (string)
cat (string)
view_count (int)
imgid (int)
thumbs_up (int)
thumbs_down (int)

these columns as well. These columns are not used for indexing, just
for storing. Do you think it is better idea to store these columns as
well and not query the database? Or, I can just get the ids and query
the database myself. Which approach is better from memory usage and
performance perspective? I was using Sphinx for full text searching on
my production websites, so I am not used to this format as Sphinx only
returns document IDs.

2-) I was using Sphinx for other purposes as well, like "browse"
section on the website. http://www.youtube.com/videos. It gives better
performance on large datasets (sorting, ordering etc). I know some
people also use solr(lucene) for this, but I have not seen any website
that use solr on their "browse" section without using Facets. So, even
if I don't use Facets, is it still useful to use solr on that section?
I will be storing a large amount of data on solr, and expect to have 1
TB data after 6-8 months.

3-) I will be using http://wiki.apache.org/solr/MoreLikeThis option
too. As I said the text files are large. Do you have any suggestions
regarding with this feature?

Thanks again,





On Sun, Apr 18, 2010 at 7:53 AM, Lance Norskog <goks...@gmail.com> wrote:
> Man you people are fast!
>
> There is a bug in Solr/Lucene. It keeps memory around from previous
> fields, so giant text files might run out of memory when they should
> not. This bug is fixed in the trunk.
>
> On 4/17/10, Lance Norskog <goks...@gmail.com> wrote:
>> The DataImportHandler can let you fetch the file name from the
>> database record, and then load the file as a field and process the
>> text with Tika.
>>
>> It will not be easy :) but it is possible.
>>
>> http://wiki.apache.org/solr/DataImportHandler
>>
>> On 4/17/10, Serdar Sahin <anlamar...@gmail.com> wrote:
>>> Hi,
>>>
>>> I am rather new to Solr and have a question.
>>>
>>> We have around 200.000 txt files which are placed into the file cloud.
>>> The file path is something similar to this:
>>>
>>> file/97/8f/840/fa4-1.txt
>>> file/a6/9d/ab0/ca2-2.txt etc.
>>>
>>> and we also store the metadata (like title, description, tags etc)
>>> about these files in the mysql server. So, what I want to do is to
>>> index title, description, tags and other data from mysql, and also get
>>> the txt file from file server, and link them as one record for
>>> searching, but I could not figure out how to automatize this process.
>>> I can give the path from the sql query like, Select id, title,
>>> description, file_path, and then solr can use this path to retrieve
>>> txt file, but I don't know whether is it possible or not.
>>>
>>> What is the best way to index these files with their tag title and
>>> description without coding in Java (Perl is ok). These txt files are
>>> large, between 100kb-10mb, so the last option is to store them in the
>>> database.
>>>
>>> Thanks,
>>>
>>> Serdar
>>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Reply via email to