try something like this:
q.alt=*:*&fq=keyphrase:hotel
though if you dont need to query across multiple fields, dismax is
probably not the best choice
On Tue, Jul 20, 2010 at 4:57 AM, olivier sallou
wrote:
> q will search in defaultSearchField if no field name is set, but you can
> specify in you
there is a first pass query to retrieve all matching document ids from
every shard along with relevant sorting information, the document ids
are then sorted and limited to the amount needed, then a second query
is sent for the rest of the documents metadata.
On Sun, Jun 27, 2010 at 7:32 PM, Babak
splitOnCaseChange is creating multiple tokens from 3dsMax disable it
or enable catenateAll, use the analysys page in the admin tool to see
exactly how your text will be indexed by analyzers without having to
reindex your documents, once you have it right you can do a full
reindex.
On Mon, Jun 28,
the general consensus among people who run into the problem you have
is to use a plurals only stemmer, a synonyms file or a combination of
both (for irregular nouns etc)
if you search the archives you can find info on a plurals stemmer
On Mon, Jun 28, 2010 at 6:49 AM, wrote:
> Thanks for the ti
you want a combination of WhitespaceTokenizer and EdgeNGramFilter
http://lucene.apache.org/solr/api/org/apache/solr/analysis/WhitespaceTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/EdgeNGramFilterFactory.html
the first will create tokens for each word the second
set generateWordParts=1 on wordDelimiter or use
PatternTokenizerFactory to split on commas
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory
you can use the analysis page to see what your filter chains are going
to do before you index
/admin/analysis.jsp
yes, you can use distributed search across shards with different
schemas as long as the query only references overlapping fields, i
usually test adding new fields or tokenizers on one shard and deploy
only after i verified its working properly
On Thu, Jun 17, 2010 at 1:10 PM, Markus Jelsma wrote:
see yonik's post on nested queries
http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/
so for example i thought you could possibly do a dismax query across
the main fields (in this case just title) and OR that with
_query_:"{!description:'oil spill'~4}"
On Thu, Jun 17, 2010 at
use a copyField and index the copy as type string, exact matches on
that field should then work as the text wont be tokenized
On Thu, Jun 17, 2010 at 3:13 PM, Pete Chudykowski
wrote:
> Hi,
>
> I'm trying with no luck to filter on the exact-match value of a field.
> Speciffically:
> fq=brand:appl
the qs parameter affects matching , but you have to wrap your query in
double quotes,ex
q="oil spill"&qf=title description&qs=4&defType=dismax
im not sure how to formulate such a query to apply that rule just to
description, maybe with nested queries ...
On Thu, Jun 17, 2010 at 12:01 PM, Blargy
youve created an infinite loop, the shard you query calls all other
shards and itself and so on, create a separate requestHandler and
query that, ex
localhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/
dont know if its the best solution but i have a field i facet on
called type its either 0,1, combined with collapse.facet=before i just
sum all the values of the facet field to get the total number found
if you dont have such a field u can always add a field with a single value
--joe
On Wed, May
hello *, im having issues with the synonym filter altering token offsets,
my input text is
"saturday night live"
its is tokenized by the whitespace tokenizer yielding 3 tokens
[saturday, 0,8], [night, 9, 14], [live, 15,19]
on indexing these are passed through a synonym filter that has this line
s
ing the RemoveDuplicatesTokenFilter(Factory) do the trick here?
>
> Erik
>
> On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote:
>
>> hello *, i have a field that is indexing the string "the
>> ex-girlfriend" as these tokens: [the, exgirlfriend, ex, gi
hello *, i have a field that is indexing the string "the
ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend]
then they are passed to the edgengram filter, this allows me to match
different user spellings and allows for partial highlighting, however
a token like 'ex' would get genera
hello *, ive been using the highlighter and been pretty happy with
its results, however theres an edge case im not sure how to fix
for query: amazing grace
the record matched and highlighted is
amazing rendition of amazing grace
is there any way to only highlight amazing grace without using phr
do `ant clean dist` within the solr source and use the resulting war
file, though in the future you might think about extending the built in
parser and creating a parser plugin rather than modifying the actual sources
see http://wiki.apache.org/solr/SolrPlugins#QParserPlugin for more info
--jo
no problem with the query.
>
> But from what I believe it should wrap around the text in the result.
>
> So if I search ie Andrew within the return content Ie would have the
> contents with the word Andrew
>
> and hl.fl=attr_content
>
> Thank you for you help
>
>
just to make sure were on the same page, youre saying that the
highlight section of the response is empty right? the results section
is never highlighted but a separate section contains the highlighted
fields specified in hl.fl=
On Wed, Mar 10, 2010 at 5:23 AM, Ahmet Arslan wrote:
>
>
>> Yes Cont
did u enable the highlighting component in solrconfig.xml? try setting
debugQuery=true to see if the highlighting component is even being
called...
On Tue, Mar 9, 2010 at 12:23 PM, Lee Smith wrote:
> Hey All
>
> I have indexed a whole bunch of documents and now I want to search against
> them.
>
ive found the csv update to be exceptionally fast, though others enjoy
the flexibility of the data import handler
On Fri, Mar 5, 2010 at 10:21 AM, Mark N wrote:
> what should be the fastest way to index a documents , I am indexing huge
> collection of data after extracting certain meta - data inf
or you can try the commongrams filter that combines tokens next to a stopword
On Tue, Mar 2, 2010 at 6:56 AM, Walter Underwood wrote:
> Don't remove stopwords if you want to search on them. --wunder
>
> On Mar 2, 2010, at 5:43 AM, Erick Erickson wrote:
>
>> This is a classic problem with Stopword
what are you using for the mm parameter? if you set it to 1 only one
word has to match,
On 03/01/2010 05:07 PM, Steve Reichgut wrote:
***Sorry if this was sent twice. I had connection problems here and it
didn't look like the first time it went out
I have been testing out results for some
you can set a default shard parameter on the request handler doing
distributed search, you can set up two different request handlers one
with shards default and one without
On Thu, Feb 25, 2010 at 1:35 PM, Jeffrey Zhao
wrote:
> Now I got it, just forgot put qt=search in query.
>
> By the way, in
extend the similarity class, compile it against the jars in lib, put in
a path solr can find and set your schema to use it
http://wiki.apache.org/solr/SolrPlugins#Similarity
On 02/25/2010 10:09 PM, Pooja Verlani wrote:
Hi,
I want to modify Similarity class for my app like the following-
Right no
i had to create a autosuggest implementation not too long ago,
originally i was using faceting, where i would match wildcards on a
tokenized field and facet on an unaltered field, this had the
advantage that i could do everything from one index, though it was
also limited by the fact suggestions ca
use the common grams filter, itll create tokens for stop words and
their adjacent terms
On Thu, Feb 18, 2010 at 7:16 AM, Nagelberg, Kallin
wrote:
> I've noticed some peculiar behavior with the dismax searchhandler.
>
> In my case I'm making the search "The British Open", and am getting 0
> resul
no, youre just changing how your querying the index, not the actual
index, you will need to restart the servlet container or reload the core
for the config changes to take effect tho
On 02/17/2010 10:04 AM, Frederico Azeiteiro wrote:
Hi,
If i change the "defaultSearchField" in the core schem
no but you can set a default for the qf parameter with the same value
On 02/15/2010 01:50 AM, Steve Radhouani wrote:
Hi there,
Can the option be used by the DisMaxRequestHandler?
Thanks,
-Steve
lucene-2266 filed and patch posted.
On 02/13/2010 09:14 PM, Robert Muir wrote:
Joe, can you open a Lucene JIRA issue for this?
I just glanced at the code and it looks like a bug to me.
On Sun, Feb 14, 2010 at 12:07 AM, Joe Calderonwrote:
i ran into a problem while using the edgengramtoken
i ran into a problem while using the edgengramtokenfilter, it seems to
report incorrect offsets when generating tokens, more specifically all
the tokens have offset 0 and term length as start and end, this leads
to goofy highlighting behavior when creating edge grams for tokens
beyond the first one
when using solr.xml, you can specify a sharedlib directory to share
among cores, is it possible to reload the classes in this dir without
having to restart the servlet container? it would be useful to be able
to make changes to those classes on the fly or be able to drop in new
plugins
if you use the core model via solr.xml you can reload a core without
having to to restart the servlet container,
http://wiki.apache.org/solr/CoreAdmin
On 02/11/2010 02:40 PM, Emad Mushtaq wrote:
Hi,
I would like to know if there is a way of reindexing data without restarting
the server. Lets sa
sorry, what i meant to say is apply text analysis to the part of the
query that is wildcarded, for example if a term with latin1 diacritics
is wildcarded ide still like to run it through ISOLatin1Filter
On Wed, Feb 10, 2010 at 4:59 AM, Fuad Efendi wrote:
>> hello *, quick question, what would i h
you can do that very easily yourself in a post processing step after
you receive the solr response
On Wed, Feb 10, 2010 at 8:12 AM, gdeconto
wrote:
>
> I have been able to apply and use the solr-236 patch (field collapsing)
> successfully.
>
> Very, very cool and powerful.
>
> My one comment/conc
hello *, quick question, what would i have to change in the query
parser to allow wildcarded terms to go through text analysis?
On iPhone so don't remember exact param I named it, but check wiki -
> something like hl.highlightMultiTerm - set it to false.
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Feb 6, 2010, at 12:00 AM, Joe Calderon wrote:
>
>> hello *, currentl
hello *, currently with hl.usePhraseHighlighter=true, a query for (joe
jack*) will highlight joe jackson, however after reading the
archives, what im looking for is the old 1.1 behaviour so that only
joe jack is highlighted, is this possible in solr 1.5 ?
thx much
--joe
i want to recompile lucene with
http://issues.apache.org/jira/browse/LUCENE-2230, but im not sure
which source tree to use, i tried using the implied trunk revision
from the admin/system page but solr fails to build with the generated
jars, even if i exclude the patches from 2230...
im wondering i
is it possible to configure the distance formula used by fuzzy
matching? i see there are other under the function query page under
strdist but im wondering if they are applicable to fuzzy matching
thx much
--joe
a shard has failed
On Wed, Feb 3, 2010 at 10:55 AM, Yonik Seeley
wrote:
> On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon wrote:
>> hello *, in distributed search when a shard goes down, an error is
>> returned and the search fails, is there a way to avoid the error and
>> re
rch will have to collate associated
> information into a presentable screen anyhow - so I'm not too worried about
> info being returned by Solr as such)
>
> Would that be a reasonable way of using Solr
>
>
>
>
> -Original Message-
> From: Joe Calderon [mailto:
by default solr will only search the default fields, you have to
either query all fields field1:(ore) or field2:(ore) or field3:(ore)
or use a different query parser like dismax
On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric wrote:
> I have got a basic configuration of Solr up and running and have
hello *, in distributed search when a shard goes down, an error is
returned and the search fails, is there a way to avoid the error and
return the results from the shards that are still up?
thx much
--joe
facets are based off the indexed version of your string nor the stored
version, you probably have an analyzer thats removing punctuation,
most people index the same field multiple ways for different purposes,
matching. storting, faceting etc...
index a copy of your field as string type and facet o
main reason im creating the new request
handler, or do i put them all as defaults under my new request handler
and let the query parser use whichever ones it supports?
On Thu, Jan 21, 2010 at 11:45 AM, Yonik Seeley
wrote:
> On Thu, Jan 21, 2010 at 2:39 PM, Joe Calderon wrote:
>> hello *
hello *, what is the best way to create a requesthandler for
distributed search with a default shards parameter but that can use
different query parsers
thus far i have
*,score
json
host0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080
this has come up before, my suggestions would be to use the 12/24
patch with trunk revision 892336
http://www.lucidimagination.com/search/document/797549d29e1810d9/solr_1_4_field_collapsing_what_are_the_steps_for_applying_the_solr_236_patch
2010/1/19 Licinio Fernández Maurelo :
> Hi folks,
>
> i'
I think you need to use the new trieDateField
On 01/12/2010 07:06 PM, Daniel Higginbotham wrote:
Hello,
I'm trying to boost results based on date using the first example
here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
However, I'm getting an er
it seems to be in flux right now as the solr developers slowly make
improvements and ingest the various pieces into the solr trunk, i think
your best bet might be to use the 12/24 patch and fix any errors where
it doesnt apply cleanly
im using solr trunk r892336 with the 12/24 patch
--joe
On
matches
sorry if i was unclear
--joe
On Mon, Jan 11, 2010 at 10:13 AM, Erik Hatcher wrote:
>
> On Jan 11, 2010, at 12:56 PM, Joe Calderon wrote:
>>
>> 1. given a set of fields how to return matches that match across them
>> but not just one specific one, ex im using a
hello *, im looking for help on writing queries to implement a few
business rules.
1. given a set of fields how to return matches that match across them
but not just one specific one, ex im using a dismax parser currently
but i want to exclude any results that only match against a field
called 'd
hello *, what do i need to do to make a query parser that works just
like the standard query parser but also runs analyzers/tokenizers on a
wildcarded term, specifically im looking to only wildcarding the last
token
ive tried the edismax qparser and the prefix qparser and neither is
exactly what i
"if this is the expected behaviour is there a way to override it?"[1]
[1] me
On Thu, Dec 31, 2009 at 10:13 AM, AHMET ARSLAN wrote:
>> Hello *, im trying to make an index
>> to support spelling errors/fuzzy
>> matching, ive indexed my document titles with
>> NGramFilterFactory
>> minGramSize=2 ma
Hello *, im trying to make an index to support spelling errors/fuzzy
matching, ive indexed my document titles with NGramFilterFactory
minGramSize=2 maxGramSize=3, using the analysis page i can see the
common grams match between the indexed value and the query value,
however when i try to do a query
how can i make the score be solely the output of a function query?
the function query wiki page details something like
q=boxname:findbox+_val_:"product(product(x,y),z)"&fl=*,score
but that doesnt seems to work
--joe
hello *, i want to boost documents that match the query better,
currently i also index my field as a string an boost if i match the
string field
but im wondering if its possible to boost with bf parameter with a
formula using the function strdist(), i know one of the columns would
be the field nam
fwiw, when implementing distributed search i ran into a similar
problem, but then i noticed even google doesnt let you go past page
1000, easier to just set a limit on start
On Thu, Dec 24, 2009 at 8:36 AM, Walter Underwood wrote:
> When do users do a query like that? --wunder
>
> On Dec 24, 200
im trying to do a wild card search
"q":"item_title:(gets*)"returns no results
"q":"item_title:(gets)"returns results
"q":"item_title:(get*)"returns results
seems like * at the end of a token is requiring a character, instead
of being 0 or more its acting like1 or more
the text im tr
sorry got cut off,
patch, then ant clean dist, will give you the modified solr war file,
if it doesnt apply cleanly (which i dont think is currently the case),
you can go back to the latest revision referenced in the patch,
On Tue, Nov 3, 2009 at 8:17 PM, Joe Calderon wrote:
> patch -p0 <
patch -p0 < /path/to/field-collapse-5.patch
On Tue, Nov 3, 2009 at 7:48 PM, michael8 wrote:
>
> Hmmm, perhaps I jumped the gun. I just looked over the field collapse patch
> for SOLR-236 and each file listed in the patch has its own revision #.
>
> E.g. from field-collapse-5.patch:
> --- src/jav
is it possible to tokenize a field on whitespace after some filters
have been applied:
ex: "A + W Root Beer"
the field uses a keyword tokenizer to keep the string together, then
it will get converted to "aw root beer" by a custom filter ive made, i
now want to split that up into 3 tokens (aw, roo
curious...is it possible to have faceted results ordered by score?
im having a problem where im faceting on a field while searching for
the same word twice, for example:
im searching for "the the" on a tokenized field and faceting by the
untokenized version, faceting returns records with "the the
found another exception, i cant find specific steps to reproduce
besides starting with an unfiltered result and then given an int field
with values (1,2,3) filtering by 3 triggers it sometimes, this is in
an index with very frequent updates and deletes
--joe
java.lang.NullPointerException
as a curiosity ide like to use a profiler to see where within solr
queries spend most of their time, im curious what tools if any others
use for this type of task..
im using jetty as my servlet container so ideally ide like a profiler
thats compatible with it
--joe
seems to happen when sort on anything besides strictly score, even
score desc, num desc triggers it, using latest nightly and 10/14 patch
Problem accessing /solr/core1/select. Reason:
4731592
java.lang.ArrayIndexOutOfBoundsException: 4731592
at
org.apache.lucene.search.FieldComparat
hello *, i was just reading over the wiki function query page and
found this little gem for boosting recent docs thats much better than
what i was doing before
recip(ms(NOW,mydatefield),3.16e-11,1,1)
my question is, at the bottom it says
The most effective way to use such a boost is to multiply
cool np, i just didnt want to duplicate code if that already existed.
On Tue, Oct 20, 2009 at 12:49 PM, Yonik Seeley
wrote:
> On Tue, Oct 20, 2009 at 1:53 PM, Joe Calderon wrote:
>> i have a pretty basic question, is there an existing analyzer that
>> limits the number of words
i have a pretty basic question, is there an existing analyzer that
limits the number of words/tokens indexed from a field? let say i only
wanted to index the top 25 words...
thx much
--joe
hello * , ive read in other threads that lucene 2.9 had a serious bug
in it, hence trunk moved to 2.9.1 dev, im wondering what the bug is as
ive been using the 2.9.0 version for the past weeks with no problems,
is it critical to upgrade?
--joe
hello *, sorry if this seems like a dumb question, im still fairly new
to working with lucene/solr internals.
given a Document object, what is the proper way to fetch an integer
value for a field called "num_in_stock", it is both indexed and stored
thx much
--joe
maybe im just not familiar with the way the version numbers works in
trunk but when i build the latest nightly the jars have names like
*-1.5-dev.jar, is that normal?
On Wed, Oct 14, 2009 at 7:01 AM, Yonik Seeley
wrote:
> Folks, we've been in code freeze since Monday and a test release
> candida
hello *, im using a combination of tokenizers and filters that give me
the desired tokens, however for a particular field i want to
concatenate these tokens back to a single string, is there a filter to
do that, if not what are the steps needed to make my own filter to
concatenate tokens?
for exam
dn't bring it up when Hoss made my
> life easier with his simpler patch.
>
> Yonik Seeley wrote:
>> Might be the new Lucene fieldCache stats stuff that was recently added?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Tue, Oct 6, 2
hello *, ive been noticing that /admin/stats.jsp is really slow in the
recent builds, has anyone else encountered this?
--joe
M when having an
> index of a few million.
>
> Martijn
>
> 2009/10/2 Joe Calderon :
>> i gotten two different out of memory errors while using the field
>> collapsing component, using the latest patch (2009-09-26) and the
>> latest nightly,
>>
>> has anyone el
i gotten two different out of memory errors while using the field
collapsing component, using the latest patch (2009-09-26) and the
latest nightly,
has anyone else encountered similar problems? my collection is 5
million results but ive gotten the error collapsing as little as a few
thousand
SEVE
thx for the reply, i just want the number of dupes in the query
result, but it seems i dont get the correct totals,
for example a non collapsed dismax query for belgian beer returns X
number results
but when i collapse and sum the number of docs under collapse_counts,
its much less than X
it does
ks,
>>
>> Matt Weber
>>
>> On Sep 30, 2009, at 5:16 PM, Uri Boness wrote:
>>
>>> Hi,
>>>
>>> At the moment I think the most appropriate place to put it is in the
>>> AbstractDocumentCollapser (in the getCollapseInfo method). Though,
how would i go about modifying the dismax parser to treat +/- as regular text?
hello all, i have a question on the field collapsing patch, say i have
an integer field called "num_in_stock" and i collapse by some other
column, is it possible to sum up that integer field and return the
total in the output, if not how would i go about extending the
collapsing component to suppor
hello *, i read on the wiki about using recip(rord(...)...) to boost
recent documents with a date field, does anyone have a good function
for doing something similar with unix timestamps?
if not, is there a lot of overhead related to counting the number of
distinct values for rord() ?
thx much
is the source for the lucid kstemmer available ? from the lucid solr
package i only found the compiled jars
On Mon, Sep 14, 2009 at 11:04 AM, Yonik Seeley
wrote:
> On Mon, Sep 14, 2009 at 1:56 PM, darniz wrote:
>> Pascal Dimassimo wrote:
>>>
>>> Hi,
>>>
>>> I want to try KStem. I'm following the
i have field called text_stem that has a kstemmer on it, im having
trouble matching wildcard searches on a word that got stemmed
for example i index the word "america's", which according to
analysis.jsp after stemming gets indexed as "america"
when matching i do a query like myfield:(ame*) which
hello *, im not sure what im doing wrong i have this field defined in
schema.xml, using admin/analysis.jsp its working as expected,
but when i try to update via csvhandler i get
Error 500 org.apache.solr.analysis.PatternTokeni
there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level
On Tue, Sep 8, 2009 at 8:08 AM, gwk wrote:
> Hi,
>
> I just completed a simple proof-of-concept
i saw some post regarding stemming plurals in the archives from 2008,
i was wondering if this was ever integrated or if custom hackery is
still needed, is there something like a stemplurals analyzer is the
kstemmer the closest thing?
thx much
--joe
hello *, what would be the best approach to return the sum of boosts
as the score?
ex:
a dismax handler boosts matches to field1^100 and field2^50, a query
matches both fields hence the score for that row would be 150
is this something i could do with a function query or do i need to
hack up Di
yonik has a point, when i ran into this i also upgraded to the latest
stable jetty, im using jetty 6.1.18
On 08/28/2009 04:07 PM, Rupert Fiasco wrote:
I deployed LucidWorks with my existing solrconfig / schema and
re-indexed my data into it and pushed it out to production, we'll see
how it stac
i had a similar issue with text from past requests showing up, this was
on 1.3 nightly, i switched to using the lucid build of 1.3 and the
problem went away, im using a nightly of 1.4 right now also without
probs, then again your mileage may vary as i also made a bunch of schema
changes that mi
facet.mincount=1, facet.limit=-1
On 08/28/2009 09:15 AM, Candide Kemmler wrote:
Sorry, I have misinterpreted my test results.
In fact, I can see that facets are missing in the original search.
So the question becomes: how is it possible that a search doesn't
report all the facets of a specific r
is there an online resource or a book that contains a thorough list of
tokenizers and filters available and their functionality?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
is very helpful but i would like to go through additional filters to
make sure im not reinventing the wheel
hello *, im currently faceting on a shingled field to obtain popular
phrases and its working well, however ide like to limit the number of
shingles that get created, the solr.ShingleFilterFactory supports
maxShingleSize, can it be made to support a minimum as well? can
someone point me in the right
i want to try out the improvements in 1.4 but the nightly site is down
http://people.apache.org/builds/lucene/solr/nightly/
is there a mirror for nightlies?
--joe
ELECT id FROM videos
WHERE
title LIKE 'family guy'
AND desc LIKE 'stewie%'
AND is_dup = 0
)
)
)
ORDER BY views
LIMIT 10
can a similar query be written in lucene or do i need to structure my
index differently to be able to do such a query?
thx much
--joe
On S
for first time loads i currently post to
/update/csv?commit=false&separator=%09&escape=\&stream.file=workfile.txt&map=NULL:&keepEmpty=false",
this works well and finishes in about 20 minutes for my work load.
this is mostly cpu bound, i have an 8 core box and it seems one takes
the brunt of the wo
nd didn't have flagged duplicates in the first place? If so, have
> you tried using http://wiki.apache.org/solr/Deduplication ?
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
hello all, i have a collection of a few million documents; i have many
duplicates in this collection. they have been clustered with a simple
algorithm, i have a field called 'duplicate' which is 0 or 1 and a
fields called 'description, tags, meta', documents are clustered on
different criteria and
98 matches
Mail list logo