Solr tf ifd

2011-12-06 Thread Nejla Karacan
Hello,

I need the tf-idf-values from texts and now Im using Apache-Solr.
I am a novice and have some Problems.

My question is, how can I extract the tf-idf-values?


There are many files in the folder apache-solr-3.5.0\example\solr\data\index

but I cant use them.

Is the Output only as a xml-File?

Please help me.

regads
Nejla Karacan


-- 
Viele Grüße
Nejla



Solr and TF-IDF

2012-01-26 Thread Nejla Karacan
Hey there,

I'm using Solr for my thesis, where I have to implement a content-based
recommender system for movies.

I have indexed about 20thousand movies with their informations:
movie-id
title
genre
plot/movie-description <- !!!
cast

I've enabled the TermvektorComponent for the fields genre, description and
cast.
So I can get the tf-idf-values for the terms of every movie.

With these term-TfIdfValue-couples I have to compute the similarities
between movies by using the cosine similarity.
I know about the Solr-Feature MLT (MoreLikeThis), but thats not the
solution, I have to
implement the CosineSimilarity in java myself.

Now I have some problems/questions:
I get the responses in XML-format, which I read out with an XML-reader in
Java,
where it wriggle trough every child-node in order to reach the right node.
Is there a better way, to get these values in Node-Attributes or node-texts?
I have tried it with wt=csv but for the requests I get
responses only with the Movie-ID's, nothing more.
By XML-responseWriter my request is for example this:
http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true
I get the right response with all terms and tf-tdf's - in xml.

And if I add csv-notation
http://localhost:8983/solr/select/?qt=tvrh&q=id:1800180382&fl=id&tv.tf_idf=true&wt=csv
I get only this:
id
1800180382

Maybe my request is wrong?

Another problem is, if I get the terms and their tfidf-values, I store
them in a map.
But there isn't a succession in the values. I want e.g. store only the 10
chief terms,
so 10 terms with the highest tfidf-values. Can I sort them in a descending
succession?
I haven't find anything therefor. If its not possible, I must sort them
later in the map.

My last question is:
any movie has a genre - often more than one.
Its like the "cat"-field (category) in the exampledocs with ipod/monitor
etc. and its an important pointfor the movies.
How can I integrate this factor?
I changed the boost-attribute in the Solr-Xml-Schema like this:

Is that enough or is there any other possibility?

Perhaps you see, that I am a beginner in Solr,
at the beginning a few weeks ago it was even more difficult for me but now
it goes better.
I would be very grateful for any help, ideas, tips or suggestions!

Many regards
Nejla