This is a message board for coordinating and discussing bot-related issues on Wikipedia (also including other programs interacting with the MediaWiki software). Although this page is frequented mainly by bot owners, any user is welcome to leave a message or join the discussion here.
If you want to report an issue or bug with a specific bot, follow the steps outlined in WP:BOTISSUE first. This not the place for requests for bot approvals or requesting that tasks be done by a bot. General questions about the MediaWiki software (such as the use of templates, etc.) should be asked at Wikipedia:Village pump (technical).
New bot-like access group
A new "bot-like" user group as appeared, "Copyright violation bots", possibly related to phab:T199359. Thank you to Dolotta for calling this out. @MMiller (WMF): do you have information about what is going on with this initiative and how it will impact editors here? — xaosflux Talk 18:45, 1 September 2018 (UTC)
- Ping also to ערן as these tasks have your bot account's name all over them. — xaosflux Talk 18:50, 1 September 2018 (UTC)
- Per Special:ListGroupRights, they have the ability to "Tag pages in the Special:NewPagesFeed as likely copyright violations, through the pagetriage-tagcopyvio API (pagetriage-copyvio)". Looks like the group was created in phab:T202041; for use in User:EranBot. Per phab:T201073 "After the attached patch is merged and deployed, there will be a new user group on English Wikipedia called "Copyright violation bots". A bureaucrat on enwiki can then put EranBot in that group." It seems like the purpose is to allow copyright violations detected by EranBot that fill up CopyPatrol to also mark pages as copyright violations on the NPP feed? Galobtter (pingó mió) 18:55, 1 September 2018 (UTC)
- Anything like this will certainly need a new BRFA, I seem to recall there was a notable false positive "copyright violation detection" problem with this bot (for example when copying public domain text). — xaosflux Talk 19:01, 1 September 2018 (UTC)
- Xaosflux: Thank you for opening this discussion. I haven't yet coded the reporting/using the API from the bot side, and haven't yet asked for this right, but I shall do it soon.
- The bot is already running on all changes in enwiki and suspected edits are reported to a database, and users go over them using Copypatrol developed by Community Tech team. Growth team is working on improving Special:NewPagesFeed and are looking for a way to integrate copyvio system into it, and the above tasks can provide some more details.
- Would you like me to open a new BRFA under Wikipedia:Bots/Requests for approval/EranBot/2? Eran (talk) 20:32, 1 September 2018 (UTC)
- If you are still in beta testing, etc - you don't have to do anything here (yet) - not until such time as you want to start testing edits or actions on (as opposed to against) the English Wikipedia. — xaosflux Talk 20:36, 1 September 2018 (UTC)
- Ping to @Roan Kattouw (WMF): for any input. Added phab:T193782 tracking to above. — xaosflux Talk 19:55, 4 September 2018 (UTC)
- Hi everyone, my apologies for not having announced this ahead of time. I thought adding an obscure new user group that isn't being used yet to the already quite long list of user groups would be unlikely to be noticed by anyone, but it took less than 48 hours. That'll teach me to never underestimate Wikipedians :)
- As others already inferred, the idea behind this group is to put EranBot in it, so that it can tell our software which new pages and drafts it thinks are possible copyvios. I wanted to clarify that this "possible copyvio" flag will only appear on Special:NewPagesFeed and nowhere else. It will say
Possible issues: Copyvio and link to CopyPatrol for a more detailed report on why it thinks it might be copyvio; see also this screenshot. CopyPatrol and EranBot are existing tools that already score and list potential copyvios, and we're trying to make it easier to use them for new page patrolling and draft review.
- The copyvio feature will be available for testing on test.wikipedia.org soon (stay tuned for an announcement from MMiller (WMF)), and is not enabled on English Wikipedia yet (meaning that even if EranBot is put in this group and starts flagging things, those flags won't be displayed). The reason we need a group is that we need a way to trust only EranBot and not other users to pass us this copyvio information, and we figured the groups+rights system was the least bad way to do that. --Roan Kattouw (WMF) (talk) 22:22, 4 September 2018 (UTC)
- @Roan Kattouw (WMF): who is the "we" on this? Will the communities control who we trust to make these inputs? Also please see my note on phab:T199359 or above regarding "unoriginal" vs "copyright infringing" and how this is being determined. — xaosflux Talk 01:59, 5 September 2018 (UTC)
- @Xaosflux: thanks for the question. The work we're doing here is essentially to post the results from the CopyPatrol tool alongside pages in the New Pages Feed. This project has been a collaboration with the NPP and AfC reviewing communities, and our product team has posted as much information as we could about how and why we're building the way we are. CopyPatrol scans all substantial edits (over 500 bytes) by using iThenticate's API (via EranBot), which is a third-party API used by institutions to check original writing for plagiarism. It checks text against its database of websites, academic journals, and books, and says what percent of the text is found in another source. This is not a definitive declaration of copyright violation, because of many potential exceptions. For instance, the text that it finds in multiple locations may in fact be public domain, or it may be a Wikipedia mirror site, or simply a long block quote. Therefore, users of CopyPatrol know that when something is flagged in that interface, it only means there is a potential violation, and that the human editor should investigate further before determining whether there is a violation. Similarly, what we'll do with the New Pages Feed is say "Potential issues: Copyvio", because it is only a potential issue, brought up by proxy through a plagiarism detection service. In conversations with the reviewing communities, it looks like the standard practice there is that any machine-generated flag about copyvio means that a human should investigate to make the decision, which is a practice they use with another popular tool, Earwig's Copyvio Detector. Does this help? -- MMiller (WMF) (talk) 19:01, 5 September 2018 (UTC)
- @MMiller (WMF): Can the "label/tag" be changed to something like "Potential issue: Copied text", "Potential issue: Reused Content", etc? We obviously take copyright very seriously as it is a core project tenet and I don't think we should blindly through around the phrase "copyright violation". Note, I'm only referring to the label, not the process; to focus on the substance of the content, not the motive of the editor. — xaosflux Talk 20:46, 5 September 2018 (UTC)
- BTW: The bot does more than just "non original" - it detects wikipedia mirrors (the source indicate it is mirror of wikipedia), creative commons content (the source indicate it is CC license) as well as citations (the added content contains link to the source). Eran (talk) 21:16, 5 September 2018 (UTC)
- @ערן: that is sort of my point, if someone adds such text their addition may not actually be a "Copyright Violation". I don't disagree that it may warrant extra recent changes patrol attention, and tagging sounds useful. — xaosflux Talk 21:25, 5 September 2018 (UTC)
- Just wanted to chime in to agree with Xaosflux here. I'm far from the most prolific user of Copypatrol, but I spend some time there. There are a number of things that aren't violations, and the false positives are usually either a vandal repeating "hi hi hi hi" or some other text, or something public domain or appropriately licensed. The latter content is frequently one of roughly three specific editors who won't show up in new pages since they're all autopatrolled, but certainly not always. I'd likewise be more comfortable with either of Xaosflux' suggested options. ~ Amory (u • t • c) 01:07, 6 September 2018 (UTC)
Xaosflux, I would like to start testing it on enwiki and get Eranbot into copyviobot group - any report that go to CopyPatrol will also goes to PageTriage via API of enwiki. AFAIK (Roan Kattouw (WMF), MMiller (WMF) correct me if I'm wrong) currently it will not displayed to users and later (once enabled in the PageTriage extension) there will be small hint with link to copypatrol for further infromation (example). As for the text of the hint, I think there are good arguments here why the hint text should be cartefully considered (shouldn't be casting aspersions of a legal violation) - it can discussed later how to name it (copycheck? copypatrol?). Thanks, Eran (talk) 10:32, 15 September 2018 (UTC)
- @ערן: please file a WP:BRFA to have your new task reviewed. It looks like some testing may have been done on testwiki, if so please include information from those tests in the BRFA. I'm glad you are open to the labeling update (note the name of the "access group" doesn't matter, we can rename that locally) and both of your above suggestions sound fine to me. I certainly expect this will catch actual "copyright violations" in addition to other copied text - but we can leave the blaming up to humans :D — xaosflux Talk 14:47, 15 September 2018 (UTC)
- Wikipedia:Bots/Requests for approval/EranBot/2. Eran (talk) 16:08, 15 September 2018 (UTC)
- (Roan Kattouw (WMF), MMiller (WMF) - it looks like in the new group creation local community access wasn't included (e.g. "Allow bureaucrats to add/remove users from this group"), this should be done for enwiki, testwiki, test2wiki - do you need us to create the phab configuration request for this? — xaosflux Talk 14:51, 15 September 2018 (UTC)
- Thanks Xaosflux and ערן. I think Roan Kattouw (WMF) can answer about the technical parts on Monday. With respect to the naming and wording, your point is taken, and I will think about this some more and discuss with the community. We're actually going to be deploying the first part of this overall project to production on Monday, and most of my attention is on that (not involving a bot). So I will revisit the naming over the next couple weeks. -- MMiller (WMF) (talk) 21:39, 15 September 2018 (UTC)
I've created a VPP section to expand 'crat access to include management of the new botgroup, copyvio bot. Please see Wikipedia:Village_pump_(policy)#bureaucrat_access_to_manage_copyviobot_group for details. Thank you, — xaosflux Talk 02:12, 11 October 2018 (UTC)