How to use full-text search URL parser to filter query results by domain name?

2019-04-06 Thread Jess Wren
I am trying to understand how to use the full-text search parser for
URLS and hostnames to filter results from a text field containing URLS
based on domain, and also how to index text columns for fast
lookup/matching based on domain.

I have a PostgreSQL database containing documents and links downloaded
by a web crawler, with the following tables:

    pages

    --

    id:  Integer (primary key)

    url: String  (unique)

    title:   String

    text:    String

    html:    String

    last_visit:  DateTime

    word_pos:    TSVECTOR

    

    links

    --

    id Integer (primary key)

    source:    String

    target:    String  

    link_text: String

    UNIQUE(source,target)

    

    crawls

    -

    id: Integer (primary key)

    query:  String

    

    crawl_results

    -

    id:   Integer (primary key)

    score:    Integer (constraint 0<=score<=1)

    crawl_id: Integer (foreign key, crawls.id)

    page_id:  Integer (foreign key, pages.id)


The `source` and `target` fields in the `links` table contain URLs. I am
running the following query to extract scored links from the top-ranking
search results, for pages that haven't been fetched yet:

    WITH top_results AS 

    (SELECT page_id, score FROM crawl_results 

    WHERE crawl_id=$1 

    ORDER BY score LIMIT 100)

    SELECT top_results.score, l.target

    FROM top_results 

    JOIN pages p ON top_results.page_id=p.id

    JOIN links l on p.url=l.source 

    WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)


However, *I would like to filter these results so that only one row is
returned for a given domain (the one with the lowest score)*. So for
instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
'http://www.foo.com/zor'), I only want the first because it has same
domain `foo.com` and has the lower score.

I was able to find documentation for the builtin full text search
parsers ,
which can parse URLS and extract the hostname. For instance, I can
extract the hostname from a URL as follows:

    SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid 
= 6;

    

    token    

    -

    www.foo.com

    (1 row)



However, I can't figure out how I would integrate this into the above
query to filter out duplicate domains from the results. And because this
is the docs for "testing and debugging text search
",
I don't know if this use of `ts_parse()` is even related to how the URL
parser is intended to be used in practice.

How would I use the "host" parser in my query above to return one row
per domain? Also, how would I appropriately index the "links" table for
"host" and "url" token lookup?

Thanks!



Re: How to use full-text search URL parser to filter query results by domain name?

2019-04-07 Thread Jess Wren
On 4/6/19 11:42 PM, haman...@t-online.de wrote:
> Hi,
>
> I have no real idea about solving the complete problem, and would probably try
> something with a temp table first.
> For extracting the hostname from a url you could use
>
> select regex_replace('https?://(.*=)/.*', '\\1', url)
>
> instead of the fulltext parser
>
> Best regards
> Wolfgang

Thanks Wolfgang, I understand that I could implement a function using
regex for this, or just create an extra column/table to store the
hostname data. But there are other parts of the application where I'll
need to extract URL path, others where i'll want to extract scheme, etc.
Since postgres has builtin capabilities for parsing URLs to do alll of
this, I'd rather just use the builtin functions instead of writing them
myself using regex or having to generate a temp table each time I do a
lookup.

So although I'm aware that there are a variety of ways to extract
hostname (right now I'm just doing it in Python), I'm really most
interested in understanding how to use the builtin Postgres URL parsers
to extract host, url path, etc and how to appropriately create indexes
based on them. The documentation for the URL parser is very sparse, and
I can't find much info online either.





Re: How to use full-text search URL parser to filter query results by domain name?

2019-04-10 Thread Jess Wren
On 4/8/19 4:50 AM, Arthur Zakirov wrote:
> I think it is normal to use ts_parse(). And I suppose you might use
> windows functions.
>
> For example, you have table links:
>
> =# create table links (score int, link text);
> =# insert into links values
>   (1, 'http://www.foo.com/bar'),
>   (2, 'http://www.foo.com/foo'),
>   (2, 'http://www.bar.com/foo'),
>   (1, 'http://www.bar.com/bar');
>
> You can use the following query:
>
> =# with l as (
>   select score, token, link,
>     rank() over (partition by token order by score) as rank
>   from links,
>     lateral ts_parse('default', link)
>   where tokid = 6)
> select score, token, link from l where rank = 1;
>  score |    token    |  link
> ---+-+
>  1 | www.bar.com | http://www.bar.com/bar
>  1 | www.foo.com | http://www.foo.com/bar
>

Thank you very much Arthur. Your suggestion led me to a query that is at
least returning correct result set. I could not figure out how to get
your rank() function to work with my query, but building on your answer
(and others from IRC etc), I ended up with the following solution:

First I created the following views:

|CREATE VIEW scored_pages AS ( SELECT crawl_results.crawl_id,
crawl_results.score, crawl_results.page_id, pages.url FROM crawl_results
JOIN pages ON crawl_results.page_id = pages.id ); CREATE VIEW
scored_links AS ( SELECT scored_pages.score, links.source, links.target,
links.link_text FROM links JOIN scored_pages ON scored_pages.url =
links.source );|

Then, using these views, I did the following query to extract the links
from the lowest scored pages in the results:

||SELECTscore,host,target FROM(SELECTDISTINCTON(token)token
AShost,score,target FROMscored_links,LATERAL
ts_parse('default',target)WHEREtokid =6ORDERBYtoken,score )asx
WHERENOTEXISTS(SELECTpp.id FROMpages pp WHEREtarget=pp.url)ORDERBYscore; ||

Does this seem like a reasonable approach? When running EXPLAIN on this
query, I get the following:

    QUERY PLAN  
  
--
 Sort  (cost=1252927.46..1252927.47 rows=1 width=100)
   Sort Key: crawl_results.score
   ->  Hash Anti Join  (cost=1248297.18..1252927.45 rows=1 width=100)
 Hash Cond: ((links.target)::text = (pp.url)::text)
 ->  Unique  (cost=1247961.08..1252591.28 rows=5 width=100)
   ->  Sort  (cost=1247961.08..1250276.18 rows=926040 width=100)
 Sort Key: ts_parse.token, crawl_results.score
 ->  Gather  (cost=1449.79..1054897.20 rows=926040 
width=100)
   Workers Planned: 2
   ->  Hash Join  (cost=449.79..961293.20 rows=385850 
width=100)
 Hash Cond: ((links.source)::text = 
(pages.url)::text)
 ->  Nested Loop  (cost=0.00..955091.41 
rows=378702 width=144)
   ->  Parallel Seq Scan on links  
(cost=0.00..4554.40 rows=75740 width=112)
   ->  Function Scan on ts_parse  
(cost=0.00..12.50 rows=5 width=32)
 Filter: (tokid = 6)
 ->  Hash  (cost=404.67..404.67 rows=3609 
width=63)
   ->  Hash Join  (cost=336.10..404.67 
rows=3609 width=63)
 Hash Cond: (crawl_results.page_id 
= pages.id)
 ->  Seq Scan on crawl_results  
(cost=0.00..59.09 rows=3609 width=12)
 ->  Hash  (cost=291.60..291.60 
rows=3560 width=59)
   ->  Seq Scan on pages  
(cost=0.00..291.60 rows=3560 width=59)
 ->  Hash  (cost=291.60..291.60 rows=3560 width=55)
   ->  Seq Scan on pages pp  (cost=0.00..291.60 rows=3560 width=55)
(23 rows)


I am wondering if there is a more efficient way to do things? Some
people on IRC mentioned that it might be better to declare a scalar
function to return the host from ts_parse instead of the LATERAL query
... but I couldn't figure out how to do that, or if it was even
preferable to the above from a performance standpoint ... any ideas on
how I could improve the above.