owned this note
owned this note
Published
Linked with GitHub
# Ethique & WebScrapping
Ecole de formation R / Labex Dynamite / Florence / Septembre 2018
![CCO](https://mirrors.creativecommons.org/presskit/buttons/88x31/png/cc-zero.png "Public Domain Licence" =100x)
[Sébastien Rey-Coyrehourcq](http://umr-idees.fr/user/s%C3%A9bastien-rey-coyrehourcq/)/[@reyman64](https://twitter.com/reyman64) avec l'aide très précieuse de Lionel Morel (alias [Scinfolex](https://scinfolex.wordpress.com )/[@calimaq](https://twitter.com/Calimaq) sur le net) qui a accepté de répondre à mes nombreuses questions de novice.
**Prologue** : The problematic is really complex, country-dependent, and I'm not a lawyer, so this is only my point of view (Sébastien Rey-Coyrehourcq) which is exposed here. Also, this is a work in progress document, we try to publish the final document somewhere on Internet during the next month.
# The law ... of the jungle ?
Many **grey zones** around the (somewhat) old pratice of webscraping remain in national and international law.
There are *indeed* several laws which protect data and databases from harvesting, e.g. the *Civil Code* (article 1240), the *Commerce Code* (article L121-2 [^L121]), the *Intellectual Code Property* (article L342-1[^L342]) and the *Penal Law* (articles 323-3[^L323] and [^L311]).
## French and European legislation
The French Penal Law clearly condemns any *fraudulent* intent, so if you're aware you overpass any security mechanisms to access data, you're guilty. While the definition of a *fraudulent act* seemed unclear during years, the *[Jurisprudence bluetouff (2015)](https://www.silicon.fr/vol-information-jurisprudence-bluetouff-pour-gloire-117057.html)* now cleary recognizes the fact that the terms "*soustraction frauduleuse de la chose d’autrui*" (fraudulent misappropriation or theft) are potentialy applicable to data. A position critized by some [lawyer](http://www.maitre-eolas.fr/post/2014/02/07/NON%2C-on-ne-peut-pas-%C3%AAtre-condamn%C3%A9-pour-utiliser-Gougleu) because, in case of duplication, the original data **doesn't disappear** (like a stolen hard drive would).
The article *L121-2* associated with the article *1240* of Civil Code also protects companies from **parasitism behaviors**. You should not scrape data from a website and then create a similar website which relies solely on this data.
The article *L342-1* about Intellectual property states that the producer of data can prohibit the extraction of data, given some unclear limit [^L342] (i.e. a "substantial part" of the data being extracted). Sadly, of course, this term "substantial" is open to interpretation ...
These two articles point to the usage of data, rather than the harvesting in itself, being a potential source of legal problems. Some Jurisprudence could help us to understand to which extent:
- RyanAir vs Opodo (Cour d'appel de Paris 2012) : This [jurisprudence](https://www.legalis.net/jurisprudences/cour-dappel-de-paris-pole-5-chambre-2-arret-du-23-mars-2012/) act in favor of WebScraping because ...
==Finir la description du cas==
- RyanAir vs PR Aviation (Netherlands Supreme Court, then Cour d'appel de Justice Européenne CUJE 2015) : This is a similar affair, in which PR Aviation scrapped prices from different airplanes companies to create an online comparator. This time, though RyanAir went to the CUJE. Although the Dutch court recognized a normal use of the database, and rejected the application of european directive 96/9, the CUJE took another [decision](https://eur-lex.europa.eu/legal-content/FR/TXT/?uri=CELEX%3A62014CJ0030), in defavour of webscraping this time. The CUJE hence acted the possibility to limit the use of database based on the limitation of the CGU [^act].
This second affair is very bad news for public domain and by extension for scraping, as the specialist Lionel Morel explains to us on his [blog Scinfolex](https://scinfolex.com/2015/01/23/linformation-ne-peut-plus-etre-libre-a-propos-dun-arret-aberrant-de-la-cjue/) :
> Car ce que la CJUE a fait disparaître par cette décision, c’est tout simplement une immense partie du domaine public : celle qui était auparavant constituée par l’information brute et les données. Son raisonnement instaure une possibilité, cette fois-ci absolue – sans aucune exception – de poser des limites par voie contractuelle à la réutilisation de l’information encapsulée dans une base, sans avoir de conditions particulières d’originalité ou d’investissement à remplir.
> *Through this decision, the CJUE simply obliterated a huge part of public domain: that part which was once constituted by raw information and data. Its reasoning introduces the possibility to limit the use of the information contained in a database through a contract, regardless of its originality or the investment it required*
We'll discuss the applicability in France and Europe of complex and/or abusive CGU later. For now, let's consider this objection : We're scientists and not not a commercial company ! You're totally right. You probably heard about the recent application of the RGPD European law, which also includes special derogations for researchers. Talk about that.
**Source :**
- « Le Web Scraping, Une Technique D’extraction Légale ? | Actualités Du Droit | Wolters Kluwer France ». Consulté le 14 août 2018. https://www.actualitesdudroit.fr/browse/tech-droit/start-up/9404/le-web-scraping-une-technique-d-extraction-legale.
- « Le Web Scraping ? C’est Quoi ? La légalité du Web Scraping en 5 minutes | François Baulu | Pulse | LinkedIn ». Consulté le 14 août 2018. https://www.linkedin.com/pulse/le-web-scraping-cest-quoi-la-l%C3%A9galit%C3%A9-du-en-5-minutes-fran%C3%A7ois-baulu?articleId=6427821828948918272.
### RGPD, some derogations for researchers ?
**Prologue :** I'll try to be short on that point, because there are already lots of litterature on this complex subject.
In the context of research, the answer is definitively complex, though the question is quite simple: How can I do *ethically* and *legally* my everyday job as a scientist, when collecting, exploring and publishing part of these new sources of data ?
Since the application of GDPR (or RGPD in french) in May, users have a new ally in their fight for their rights on the Internet.
For "Mr. GDPR", alias Giovanni Buttarelli (European Union’s data protection supervisor),
> The GDPR aims to redress the startling imbalance of power between big tech and the consumer, giving people more control over their data and making big companies accountable for what they do with it. It replaces the 1995 Data Protection Directive, which required national legislation in each of the 28 E.U. countries in order to be implemented. And it offers people and businesses a single rulebook for the biggest data privacy questions. Tech titans now have a single point of contact instead of 28. ([source Washington post 14/08/2018](https://www.washingtonpost.com/news/theworldpost/wp/2018/08/14/gdpr/?noredirect=on&utm_term=.11ebf48ebf46))
But a case study would be useful to assess the robustness of this recently applied law, when confronted to companies tempted to exploit flaws in the text.
The **DisinfoLab affair** which occurred during/because of the **Benala affair** this August 2018 is probably a great first textbook case for RGPD/CNIL.
The 08 of August, the belgian ONG **Disinfo lab** published [a report](https://spark.adobe.com/page/Sa85zpU5Chi1a/) which exhibited some allegedly inhabitual activity on Twitter around the Benala affair, presumably due to *russian bots*. Since then, the methodology and conclusions of this report have been criticized by many media and scientists. As far as we are concerned, though, the problem is elsewhere: to legitimize its methodology, **Disinfo lab** made raw data publicly available on its website (part of it has been deleted since). The three files contained personnal data : pseudonyms, texts of public profile, tweets and retweets, and for some accounts (identified by the ONG as hyper-active people)... a number which identified a politic orientation.
Arthur Messaud, one of the jurists in the Quadrature du Net ONG, observed that these files were illegal in the eye of the recent RGPD :
> Cette publication n’était nécessaire à la poursuite d’aucun objectif. Or, publier des données perso [nnelles] sans consentement est toujours illicite, si ce n’est nécessaire à aucun objectif » [Source](https://twitter.com/laquadrature/status/1027556732921364480)
So what ? You could legitimely object that part of this information is public, and that anyone can read these people's opinion on their public timeline.
Once more, it seems the answer is complex and depends on your country ... Regardless of the incoming [public decision of CNIL](https://www.cnil.fr/fr/etude-realisee-partir-de-messages-postes-sur-twitter-la-cnil-est-saisie-du-dossier) on this affair, many people already consider that expressing a public opinion to your followers doesn't give any implicit permission to be taken to task later by someone !
In an article of Numerama citing Valérie Nicolas, we understand clearly that all these informations (pseudonyms, tweets, retweets, likes, geolocalisation, etc.) are identified by CNIL as [personal](https://www.cnil.fr/cnil-direct/question/492?visiteur=part) because you could easily cross-check informations to find the physical person behind the tweets.
**Personal data** is defined in RGPD as
> any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person; [Source](https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en)
Some examples :
- a name and surname
- a home address
- an email address such as name.surname@company.com
- an identification card number
- location data (for example the location data function on a mobile phone)*
- an Internet Protocol (IP) address
- a cookie ID
- the advertising identifier of your phone
- data held by a hospital or doctor, which could be a symbol that uniquely
identifies a person.
Worst, some of this personal information collected by this entreprise is also identified as [*données sensibles*](https://www.cnil.fr/cnil-direct/question/495) by the CNIL or *categorie particulières de données à caractère personnel* in RGPD (adding three new case). This data covers ethnics, sexuality, political opinion, etc.
In France **collection, manipulation or publication of personal and/or sensible data before or after any treatment** is systematicly prohibited without the **explicit consent** of individuals, right ?
**By default yes**, but depending of data sensibilities there are different derogations to authorize usage of these data **without an explicit consent** !
In RGPD, the *explicit consent* for treatment of *données à caractère personnel* is only **one of the six derogation** listed by [article 6](https://www.cnil.fr/fr/cnil-direct/question/1308).
The extended category of personnal data named *categorie particulières de données à caractère personnel* derogations are listed in [Article 9 of RGPD](https://www.cnil.fr/fr/reglement-europeen-protection-donnees/chapitre2#Article9)
If you look at this two articles, by chance you see that RGPD includes some derogations to protect the work of researchers, but also others professions, like journalist[^article85].
If we go back to *Disinfolab affair*, this ONG claimed some of these derogations applied to its case (see [document](http://disinfo.eu/2018/08/09/communique-du-eu-disinfolab/)) in order to bypass `a)` the absence of explicit consent and `b)` the right for individuals to be informed.
Among the derogations explicited in the [Article 9 of RGPD](https://www.cnil.fr/fr/reglement-europeen-protection-donnees/chapitre2#Article9), though, only two seem admissible in this case :
> [...]
> e) le traitement porte sur des données à caractère personnel qui sont manifestement rendues publiques par la personne concernée;
> [...]
> j) le traitement est nécessaire à des fins archivistiques dans l'intérêt public, à des fins de recherche scientifique ou historique ou à des fins statistiques, conformément à l'article 89, paragraphe 1, sur la base du droit de l'Union ou du droit d'un État membre qui doit être proportionné à l'objectif poursuivi, respecter l'essence du droit à la protection des données et prévoir des mesures appropriées et spécifiques pour la sauvegarde des droits fondamentaux et des intérêts de la personne concernée.
> [...]
The `e)` derogation ask some new questions, there are two way to collect data, directly from users, or using some data sellers. In this case, Disinfolab use a specialized platform named *Visibrain* to collect data from Twitter, so we're probably in the second case [source](http://www.reputatiolab.com/2018/07/affaire-benalla-reseaux-sociaux-resurrection-partis-de-lopposition/) As we discuss later in the TOS section, we know that the way that Twitter sell our data to other companies by collecting our consent using TOS/CGU is clearly abusive ... We see why later, but in two words, the absence of finality which drive the harvest (and the reselling !) of data is one big part of this problem. RGPD is very clear on [finality](https://www.cnil.fr/fr/definir-une-finalite) of data processing :
- You need an explicit and clear objective for data harvesting
- This finality **cannot change** during project !
- This finality drive the data you collect
- This finality determine the duration of conservation od data
Moreover, although raw information was "publicly" accessible, the files shared by Disinfolab also contained a new categorization/data based on some largely biased algorithms. Is it a form of *profilage automatisé sans décision automatisé* case [source](https://www.cnil.fr/fr/profilage-et-decision-entierement-automatisee) ?
Even in this case, less strict than *profilage automatisé avec décision automatisé*, profilage of sensible data **is forbidden by default (see derogations of article 6), and in any case the users need to be informed of their rights** ([CNIL page](https://www.cnil.fr/cnil-direct/question/1381?visiteur=part), [RGPD chapter 3](https://www.cnil.fr/fr/reglement-europeen-protection-donnees/chapitre3#Article22) and [blog avocatspi]([http://avocatspi.com/2017/01/17/profilage-ce-que-dit-le-nouveau-reglement-europeen/]))
DisinfoLab mention for their defense the RGPD "not so well defined" notion of *intérêt légitime* (point `f)` of [article 6](https://www.cnil.fr/fr/reglement-europeen-protection-donnees/chapitre2#Article6)) to legitimate their treatment. But as very well said by [Scinfolex in dedicated post](https://scinfolex.com/2018/08/21/affaire-disinfolab-quelles-retombees-potentielles-sur-la-recherche-publique-et-la-science-ouverte/), this notion be cannot invocated *à moins que ne prévalent les intérêts ou les libertés et droits fondamentaux de la personne concernée qui exigent une protection des données à caractère personnel*. With the publication of all files which contains twitter id and political orientation, the Disinfolab defense seems very weak ...
After some very interesting discussions with Scinfolex on Twitter, the best defence for Disinfolab ONG was probably to declare their studies under the scientific category, because this option `j)` of article 9 open more derogations than others. This part is very interesting for us.
I'll try to summarise it in some bullet points for you, but please consult the original post for detailled explanation !
Principal derogations for researchers :
- Flexibility on **finality** associated with your harvest project : finality associated with data could change with study.
- Flexiblity of data retention duration
- Possibility to acquire correct data (CGU!) from public/private third parties (twitter, facebook, insee, etc.) without re-asking to individual a new explicit consent.
- If it's technicaly impossible to contact all individuals to get their explicit consent or to give information about their rights.
- Sensible data processing is probably possible, but it depends of the finality of data harvesting (probably judged case by case by CNIL).
Duty / Advice :
- conctact local CNIL at university if exist or directly the global CNIL !
- anonymise or pseudonymise the data !
- minimise the data collected based on finality of the study
Actually, we don't know if the belgium ONG falls into this **research category**, and this is probably the CNIL which will decide on this
point... So, **Wait & See** now.
==Todo transition==
So, as Lionel Morel told me in a private message, **there is currently no exception for text or data mining**. We could observe the construction of the future European Copyright directive (see latest news on [scinfolex](https://scinfolex.com/2018/07/11/debat-de-la-derniere-chance-au-parlement-europeen-pour-reconcilier-le-droit-dauteur-et-les-libertes-tribune-liberation/) and [Quadrature du Net](https://www.laquadrature.net/fr/copyright_mandat)) to see what happens on this point, but even if some sort of derogation is included in this directive, that could take a long long time before application...
But if you follow actuality on this point, you could see that actually things go really wrong with the recent decision of European Union :
==Work in progress==
**By chance, since june 2018 we have the perfect exemple of procedure to follow if you objective is harvesting (any) data on social network. It take the form of an official deliberation published on legifrance website :** https://www.legifrance.gouv.fr/affichCnil.do?id=CNILTEXT000036945250
==Todo SOURCE==
We now expose some USA legislation, first to appreciate the difference between legislations on webscraping activity but also to try to understand if the complex CGU of some GAFA are compatible with French or European law.
---
This articulation between personal data and research context is very well described in some long and very detailled dedicated post of Scinfolex :
- ["Données personnelles et recherche scientifique"](https://scinfolex.com/2018/07/18/donnees-personnelles-et-recherche-scientifique-quelle-articulation-dans-le-rgpd/)
- ["Affaire DisinfoLab : quelles retombées potentielles sur la recherche publique et la science ouverte ?"](https://scinfolex.com/2018/08/21/affaire-disinfolab-quelles-retombees-potentielles-sur-la-recherche-publique-et-la-science-ouverte/).
Other more generaliste articles about DisinfoLab affair :
- « Est-il vrai que l’ONG DisinfoLab s’est rendue coupable sur Twitter de fichage politique? » Libération.fr, 9 août 2018. http://www.liberation.fr/checknews/2018/08/09/est-il-vrai-que-l-ong-disinfolab-s-est-rendue-coupable-sur-twitter-de-fichage-politique_1671816.
- « Affaire Benalla : cinq questions sur l’étude de l’ONG DisinfoLab, accusée de fichage politique ». Consulté le 16 août 2018. https://www.francetvinfo.fr/politique/emmanuel-macron/agression-d-un-manifestant-par-un-collaborateur-de-l-elysee/affaire-benalla-cinq-questions-sur-l-etude-de-l-ong-disinfolab-accusee-de-fichage-politique_2890139.html.
## USA legislation
for example the (old) US *CCFA law* of 1986, or in France,
==In work==
# Algorithm & ethics
## Webscraping, following the GAFAM rules ?
> If you are not paying for it, you're not the customer; you're the product being sold.
> [Andrew Lewis in 2010](https://quoteinvestigator.com/2017/07/16/product/)
You could follow some ethics of webscraping : following -unclear- law, following `robot.txt`, following TOS prohibiting scraping, but please please don't be naive ... We see in many ways in this chapter that big companies don't respect any of these rules.
Example of industry of scraping :
https://www.diffbot.com/welcome/data.jsp
Some examples pick from dozens of aggregated services made by google illustrate this simple fact, google, but also other GAFAM act like *king of the jungle* when we talk of data collection/manipulation/aggregation :
- Flight services of google which crawl and compare price of airflights companies : https://www.google.fr/flights/
- Google scholar crawl and index scientific publication
- Google news crawl and aggregate news at international level
- Google shopping
By creating and publish such aggregated services on google first result, Google serve it's first own business, bypassing tools from other companies, bypassing the first sources of data collected, and finaly collect money by adding ads to result pages.
But as say in 5.3 point in google TOS :
> 5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.
> ‘Scraping’ is Just Automated Access, and Everyone Does It
> Jamie Williams
==Transition==
Consider **TOS** for example, this is a perfect example of abusive method used by company to limit webscraping, but not only that, **they limit your right by injecting the idea that YOUR personnal data in some common data.**
As we see recently with Cambridge Analytica affair, but also with Disinfolab affair in France, and probably many affairs in the next years, you have the right **to not be ok** with the usage of your data.
TOS is a contract between user and company which provide the services. First of all, many TOS are impossible to read, see the work made by Dima Yarovinsky, who prints some TOS to illustrate this problem.
![Dima Yarovinsky](https://i.imgur.com/RHBKWmJ.jpg)
This text is adapted by compagny based on the law for each country. But in many cases, international company of data don't respect this and inject abusive claude in their TOS. Even in the USA, few people read this complex TOS, and as say by Facebook CEO Mark Zuckerberg in front of Senator who ask him what he think consumers understand of Facebook TOS, he say : *I don’t think the average person likely reads that whole document.* [(Bloomberg 04/20/2018)](https://www.bloomberg.com/news/articles/2018-04-20/uber-paypal-face-reckoning-over-opaque-terms-and-conditions)
Do you really think that a young french children read all the TOS in english before creating an account on Facebook ? A judge probably invalid the TOS in this case ... Even is you read all the TOS, there are constant modification applied, humanly impossible to follow. The project [tosdr](https://tosdr.org/) and [tosback](https://tosback.org/) try to help users on this point.
In France, you probably heard of the story of this French professor Frédéric Durand-Baïssas, who attacked Facebook in 2011 for the brutal desactivation of this account due to his posting on his wall the famous painting of Gustave Courbet "Origine du monde". Facebook first argued that, based on its TOS, only an American court could judge this case. This was not the point of view, though, of the French judge who took it. Seven years afterwards, *the tribunal of grande instance de Paris*, concluded that closing an account without any consent was actually abusive. But because Facebook was not really condemned in regard to freedom of expression, Mr Durand-Baissas is still fighting [(Telerama, article mars 2018)](https://www.telerama.fr/medias/facebook-vs-lorigine-du-monde-la-justice-considere-quil-y-a-eu-faute,-mais-ne-condamne-pas,n5528912.php)
In another affair against Twitter's TOS started in 2014, the UFC-Que-choisir ONG finally obtained the removal of **250 abusive clauses** in Twitter's TOS on August 7th 2018 [(source ufc-que-choisir)](https://www.quechoisir.org/action-ufc-que-choisir-reseaux-sociaux-et-clauses-abusives-l-ufc-que-choisir-obtient-la-suppression-de-centaines-de-clauses-des-conditions-d-utilisation-de-twitter-n57621/)
In Europe, you've probably seen lots of pop-ups during the last month, giving you a "choice" : **Accept conditions or leave the service**. Not much of a choice, is it? Scinfolex exposes the case of Facebook in this [long post](https://scinfolex.com/2018/04/22/veuillez-accepter-nos-conditions-la-fabrique-du-consentement-chez-facebook-et-les-moyens-dy-mettre-fin/). The RGPD prohibits this type of forced consent [(see this article of Mr. GDPR : Big tech is still violating your privacy )](https://www.washingtonpost.com/news/theworldpost/wp/2018/08/14/gdpr/?noredirect=on&utm_term=.11ebf48ebf46), this is why the EU ONG [NOYB](https://noyb.eu/), and in France *la quadrature du net ONG* intented a collective action to alert the CNIL [(see the collective action on GAFAM)](https://gafam.laquadrature.net)
Another abuse concerning data is the implicit consent associated to various algorithms on the web. A very good example of this problem is the ReCaptcha of Google.
![HellCaptach](https://upload.wikimedia.org/wikipedia/fr/9/9d/Captcha_google_checkbox.gif)
If you're lucky, you'll see the green mark appear on your first try, but if like me you use tools to protect your privacy (ad blocker, user agent switcher, eff privacy extension, https everywhere or worst, a VPN or TOR network), you'll probably need to click and click again on little squares with very bad photos of streetname, cars, buses, road signs, or other very silly things to train some neurals networks to do ... hum ... to do what exactly ?
Hence, we give our time for free, as well as our implicit consent to **ONE compagny** to access some common data on a website, and we do so without knowing the **finality** of this **digital labor**... As Scinfolex or Affordance.info say, what would happen if all of these thousands of clicks per second helped the US army to train some future drone brain ? This aspect is detailed in this very interesting affordance.info's [blogpost on this subject](http://affordance.typepad.com/mon_weblog/2018/03/im-a-digital-worker-killing-an-arab.html) :
> La question de la traçabilité, mais surtout celle de l'intentionnalité des régimes de collecte est essentielle.
This problem could be generalized to all automatised algorithms that collect and perform decisions based on human interactions. This new field of **"Ethics of Algorithms"** tries to expose and understand the causes and consequences (when it's possible, because complexity is the norm) of this automated background on human life, virtually on the Internet, but also physicaly in "real world".
---
[^article85]: Dans le cadre du traitement réalisé à des fins journalistiques ou à des fins d'expression universitaire, artistique ou littéraire, les États membres prévoient des exemptions ou des dérogations au chapitre II (principes), au chapitre III (droits de la personne concernée), au chapitre IV (responsable du traitement et sous-traitant), au chapitre V (transfert de données à caractère personnel vers des pays tiers ou à des organisations internationales), au chapitre VI (autorités de contrôle indépendantes), au chapitre VII (coopération et cohérence) et au chapitre IX (situations particulières de traitement) si celles-ci sont nécessaires pour concilier le droit à la protection des données à caractère personnel et la liberté d'expression et d'information.
[^L121]: Une pratique commerciale est trompeuse si elle est commise dans l'une des circonstances suivantes : [...] 1) Lorsqu'elle crée une confusion avec un autre bien ou service, une marque, un nom commercial ou un autre signe distinctif d'un concurrent ; [...] 2) Lorsqu'elle repose sur des allégations, indications ou présentations fausses ou de nature à induire en erreur et portant sur l'un ou plusieurs des éléments suivants [...] 3) Lorsque la personne pour le compte de laquelle elle est mise en œuvre n'est pas clairement identifiable.
[^L342]: Le producteur de bases de données a le droit d'interdire : a) L'extraction, par transfert permanent ou temporaire de la totalité ou d'une partie qualitativement ou quantitativement substantielle du contenu d'une base de données sur un autre support, par tout moyen et sous toute forme que ce soit ; b) La réutilisation, par la mise à la disposition du public de la totalité ou d'une partie qualitativement ou quantitativement substantielle du contenu de la base, quelle qu'en soit la forme. Ces droits peuvent être transmis ou cédés ou faire l'objet d'une licence. Le prêt public n'est pas un acte d'extraction ou de réutilisation.
[^L323]: Le fait d'introduire frauduleusement des données dans un système de traitement automatisé, d'extraire, de détenir, de reproduire, de transmettre, de supprimer ou de modifier frauduleusement les données qu'il contient est puni de cinq ans d'emprisonnement et de 150 000 € d'amende. Lorsque cette infraction a été commise à l'encontre d'un système de traitement automatisé de données à caractère personnel mis en œuvre par l'Etat, la peine est portée à sept ans d'emprisonnement et à 300 000 € d'amende.
[^L311]: Le vol est la soustraction frauduleuse de la chose d'autrui.
[^act]: *La directive 96/9/CE du Parlement européen et du Conseil, du 11 mars 1996, concernant la protection juridique des bases de données, doit être interprétée en ce sens qu’elle n’est pas applicable à une base de données qui n’est protégée ni par le droit d’auteur ni par le droit sui generis en vertu de cette directive, si bien que les articles 6, paragraphe 1, 8 et 15 de ladite directive ne font pas obstacle à ce que le créateur d’une telle base de données établisse des limitations contractuelles à l’utilisation de celle-ci par des tiers, sans préjudice du droit national applicable.* [source](https://eur-lex.europa.eu/legal-content/FR/TXT/?uri=CELEX%3A62014CJ0030)
### Abbreviations
*[WebScrapping]: Web scraping is a term for various methods used to collect information from across the Internet.
*[XHR]: XHR is a GET request which even works outside ofthe website
*[DOM]: The Document Object Model (DOM) is a cross-platform and language-independent application programming interface that treats an HTML, XHTML, or XML document as a tree structure where in each node is an object representing a part of the document.
*[XPath]:
*[Captcha]:
*[TOR]:
*[Docker]:
*[AWS]:
*[TOS]: Term Of Services
*[XMLHttpRequests]: (XHRs) vers API
*[API]
*[Headless]: navigateur : Selenium /
*[JSON]:
*[JSON-LD]:
*[User-Agent]:
*[Bot Mitigation]:
*[Proxy]:
*[CSS Selector]:
*[CDN]: (content delivery network) which acts as a reverse proxy, saving bandwhidt and protecting websites against denial-of-service attacks.
- cloudflare
- distil https://www.distilnetworks.com/block-bot-detection/