On 07/12/23 at 20:24 +0100, Andreas Tille wrote:
> Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
> > On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
> > > Hi,
> > >
> > > by chance I realised that the uploaders table contains some names where
> > > names
> > > are not stripped:
> > >
> > > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from
> > > uploaders u where name like '% ' or name like ' %' ;
> > > name_with_spaces | uploader
> > > --------------------------+-------------------------------------------
> > > " Mehdi Dogguy" | Mehdi Dogguy <[email protected]>
> > > " David Paleino" | David Paleino <[email protected]>
> > > " Stéphane Glondu" | Stéphane Glondu <[email protected]>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
> > > "Andreas Tille " | Andreas Tille <[email protected]>
> > > " LI Daobing" | LI Daobing <[email protected]>
> > > " David Paleino" | David Paleino <[email protected]>
> > > " Stefano Zacchiroli" | Stefano Zacchiroli <[email protected]>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <[email protected]>
> > > "Colin Tuckley " | Colin Tuckley <[email protected]>
> > > "Colin Tuckley " | Colin Tuckley <[email protected]>
> > > "Colin Tuckley " | Colin Tuckley <[email protected]>
> > > (20 rows)
> > > ...
> > > UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' '
> > > || email WHERE name like ' %' or name like '% ' ;
> > >
> >
> > Uploaders is refreshed every few hours from archive data, so a one-time
> > UPDATE would not help. UDD usually tries to preserve inaccuracies, so
> > those might be interesting for QA work.
>
> OK.
>
> > In your case, why don't you use the email address to identify uploaders?
>
> Since this also does not work:
>
> udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%'
> GROUP BY uploader;
> count | uploader
> -------+------------------------------------
> 1 | Andreas Tille <[email protected]>
> 1 | Andreas Tille <[email protected]>
> 8785 | Andreas Tille <[email protected]>
> (3 Zeilen)
>
> > (possibly combining it with the carnivore data to identify different emails
> > belonging to the same person ?)
>
> I could fiddle around with carnivore but that's overkill for thst
> purpose and I insist that not stripping blanks from names does not make
> any sense, IMHO. (1 Zeile)
>
>
> BTW: I found
>
> udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN
> maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh
> WHERE name ilike '%tille%' group by name;
> count | name
> -------+---------------
> 16524 | Andreas Tille
> (1 Zeile)
>
> So why do I have 8707 uploads per uploaders but 16524 per upload_history?
>
> Is my assumption wrong that both values should match (modulo some wrongly
> spelled names)
If you look at the uploaders table, there are three columns:
- 'uploader', than contains the raw data
- 'name' and 'email' that contain the parsed (and trimmed) data
udd=> select uploader, name, email, count(*) from uploaders where uploader
ilike '%tille%' group by 1,2,3;
uploader | name | email |
count
------------------------------------+-----------------+------------------+-------
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 8785
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1
Andreas Tille <[email protected]> | Andreas Tille | [email protected] | 1
So, just use name and/or email?
Lucas