facet filter n

carmen Tue, 08 Feb 2011 04:10:26 -0800

;; This buffer is for notes

_babel
Java exceptions, server down, and currently this
{
        "items" : [
                {
                        "TDATE" :         "0:00:00",
                        "MOD" :           "D",
                        "STATION" :       [
                                "EWTN (WEWN)",
                                "WEWN",
                                "WEWN EWTN Catholic R.",
                                "Radio Free Asia",
                                "CNR1 Jammer",
                                "IBB",
                                "R.FARDA",
                                "Radio Farda"
                        ],


(wong as theres only STATION per row, suppose its open source and i could 
install a Babel locally and try to figure it out)

_google-refine
latest snapshot, unreported parse errors, visible as entire lines or even the 
rest of the document appearing in single facet fieldnames.. 

wrote a TSV parser that works on the 
xls2txt(http://wizard.ae.krakow.pl/%7Ejb/xls2txt/) output of a XLS file from 
hfskeds(http://www.hfskeds.com/skeds/)

 def csv
    open(node).readlines.map{|l|l.chomp.split(/,/)}.do{|t|
      t[0].do{|x|
        t[1..-1].each_with_index{|r,ow|r.each_with_index{|v,i|
            yield '#r'+ow.to_s,x[i],v
          }}}}
  end

this is turned into an inmemory RDF/JSON graph, 

  # fromStream :: Graph -> tripleSource -> Graph
  def fromStream m,*i
    send(*i) do |s,p,o|
      m[s] ||= {'uri'=>s}
      m[s][p] ||= []
      m[s][p].push o
    end; m
  end

and finalyl to Exhibit JSON via 

 fn Render+'application/json+exhibit',->d,e{
  fields=e.q['f'].do{|f|f.split /,/}
  {items: d.values.map{|r|
      r.keys.-(['uri']).map{|k|
        f=k.frag.do{|f|(f.gsub /\W/,'').downcase} # alphanumeric id restriction 
        if !fields || (fields.member? f)
          r[f]=r[k][0].to_s # rename fieldnames, unwrap value
          r.delete k unless f==k # cleanup unless id same as before
        else
          r.delete k
        end}
      r[:label]=r.delete 'uri' # requires label only
      r
    }}.to_json}


the reason we massage the fieldnames is elucidated in this message

http://www.mail-archive.com/[email protected]/msg01052.html

all of this is integrated into http://gitorious.org/element , drop a .tsv file 
in a directory ,add ?view=exhibit to querystring , get an exhibit


brought me to the next problem, browser freezing up for 90 seconds as Exhibit 
did something - DOM generation and facet statistics i guess

so i forget exactly what happened next but was already using dynamic 
stylesheets in a mail app (each replied-to line wrapped in class=quote , and 
span.quote {display:none} added to document to hide. it was pretty obvious this 
would be faster than 
document.getElementsByClassName('quote').forEach(function(){this.hide})

decided to take same approach to faceted filtering in browser, i have no idea 
if my choices r the fastest but they work and will probably do further 
experiments (eg, situating common facet values as innermost or outermost ala 
the SPARQL trick of using the smallest pattern first)

changing qs view=exhibit -> view=e

if a= isnt specified (comma-seperated list of predicate URIs) you are presented 
with a list, like:

http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://rdfs.org/sioc/ns#addressed_to
http://rdfs.org/sioc/ns#has_creator
http://purl.org/rss/1.0/category
[Go]

click the ones you want, [Go]

at which point, left side is filled with facet-selector panes

custom views are selected with ev=board

a convention of view/board/base
                view/board/item

where base is handed a function that it calls to put the items wrapped in 
special divs that the CSS will use to filter

a music player, /item draws a single playlist row:

http://blog.whats-your.name/public/smiths.png

figuring out result set is only half the battle for browser, excessive use of 
floats, relative sizes and so on become noticeable in huge data sets

hfskeds is 30K rows, 22 cols or .66 million triples. roughly the upper bounds 
of what i'd want to use, on a Netbook. takes about 5 seconds to load a doc and 
0.8 second to redraw after filter change

can squeeze out faster redraw
<pre>, fixed-heights/widths, absolute positioning

shortwave schedules were main dataset so lets get into some of those

http://blog.whats-your.name/public/25m.html

#!/bin/sh
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=2200&minP=kc/s&maxP=kc/s&max=2500'
 > 120m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=3100&minP=kc/s&maxP=kc/s&max=3450'
 > 90m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=3890&minP=kc/s&maxP=kc/s&max=4000'
 > 75m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=4740&minP=kc/s&maxP=kc/s&max=5125'
 > 60m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=5800&minP=kc/s&maxP=kc/s&max=6300'
 > 49m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=7200&minP=kc/s&maxP=kc/s&max=7600'
 > 40m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=9400&minP=kc/s&maxP=kc/s&max=9999'
 > 31m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=11500&minP=kc/s&maxP=kc/s&max=12160'
 > 25m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=13500&minP=kc/s&maxP=kc/s&max=13900'
 > 22m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=15100&minP=kc/s&maxP=kc/s&max=15900'
 > 19m.html
curl 
'http://m/a.tsv?view=e&ev=sw&a=LANGUAGE,STATION&min=17500&minP=kc/s&maxP=kc/s&max=17900'
 > 16m.html

created a HTML file for each band and uploaded to webserver..

as you can see a default filter exists, maxP, minP (matchP too) which is handy 
for common uses

custom filters to be activated via QS (comma-seperated list) can be written, eg 
exerpt

sort of a natural-language one, realizing any time an int < 2400 in email is 
probably referring to a time, and >2400 to frequency (minus a few false 
positives for phone numbers, years)

           m[u]={'uri' => u,
             'big'=>l.scan(/\b[A-Z][A-Z][A-Z]+\b/),
             Content=>l}
           l.scan(/\d{4,}/){|d| d=d.to_i
             if (d > 2400) && (d < 30000)
               m[u]['kc/s']=[d]
             elsif
               m[u]['BTIM']=[d];m[u]['ETIM']=[d+30]
             end}
           m.delete u unless m[u].has_keys ['BTIM','kc/s']
           )}
  
filter mutates the request-time JSON model however sees fit, adding new 
properties and so on..

http://blog.whats-your.name/public/GlenDoes31.html

i did a few more of these, Eibi L and H: 
http://blog.whats-your.name/public/eibiL.html (this is the largest one up now, 
data-wise)

http://blog.whats-your.name/public/bbc.html BBC

onto some other examples

/t is a lifestream (http://www.cs.yale.edu/homes/freeman/dissertation/etf.pdf) 
serving a time-range of resource (with options for start/end direction 
(Ascending/descending) and count) here filtered by source

http://i574.photobucket.com/albums/ss187/ix9/hyper/2011-01-16-203039_1366x768_scrot.png

 always add a sioc:addressed_to and sioc:creator to triple-izers for this usage


/search  examine shows us top poster is Cory Doctorow (no surprise there)

http://i574.photobucket.com/albums/ss187/ix9/hyper/to.png

i imported all boingboing posts for this one, thats discussed @ 
http://blog.whats-your.name/public/bb.html

a couple possibilities

hash URIs for filters. i will wait for Exhibit 3.0 to come up with their 
convention and use that. or just soemthing like facet=val,val2&facet2=val3,val4

visible set - jQuery has a :visible meta-selector, which i have not tried to 
see how fast it is. would be useful if you want to reserialize a document 
deleting all invisible (filtered) elements.. probably we should make noise 
about adding right to css as it likely has feature already eg Ctrl-F only 
searches visible els

"just publish RDFa" would be cool, some JS that introspects a DOM and adds the 
appropriate facet wrappers
-c

facet filter n

Reply via email to