bitbot-3.11-fork/src/utils/http.py

import re, signal, traceback, urllib.error, urllib.parse
import json as _json
import bs4, requests

USER_AGENT = ("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36")
REGEX_HTTP = re.compile("https?://", re.I)

RESPONSE_MAX = (1024*1024)*100

class HTTPException:
    pass
class HTTPTimeoutException(HTTPException):
    pass
class HTTPParsingException(HTTPException):
    pass

def throw_timeout():
    raise HTTPTimeoutException()

def get_url(url, method="GET", get_params={}, post_data=None, headers={},
        json_data=None, code=False, json=False, soup=False, parser="lxml",
        fallback_encoding="utf8"):

    if not urllib.parse.urlparse(url).scheme:
        url = "http://%s" % url

    if not "Accept-Language" in headers:
        headers["Accept-Language"] = "en-GB"
    if not "User-Agent" in headers:
        headers["User-Agent"] = USER_AGENT

    signal.signal(signal.SIGALRM, throw_timeout)
    signal.alarm(5)
    try:
        response = requests.request(
            method.upper(),
            url,
            headers=headers,
            params=get_params,
            data=post_data,
            json=json_data,
            stream=True
        )
        response_content = response.raw.read(RESPONSE_MAX, decode_content=True)
    except TimeoutError:
        raise HTTPTimeoutException()
    finally:
        signal.signal(signal.SIGALRM, signal.SIG_IGN)

    if soup:
        soup = bs4.BeautifulSoup(response_content, parser)
        if code:
            return response.code, soup
        return soup

    data = response_content.decode(response.encoding or fallback_encoding)
    if json and data:
        try:
            data = _json.loads(data)
        except _json.decoder.JSONDecodeError as e:
            raise HTTPParsingException(str(e))

    if code:
        return response.status_code, data
    else:
        return data

def strip_html(s):
    return bs4.BeautifulSoup(s, "lxml").get_text()
Use signal.alarm to Deadline utils.http.get_url and throw useful exceptions 2018-10-10 13:25:44 +00:00			`import re, signal, traceback, urllib.error, urllib.parse`
Change utils.http to use requests 2018-10-10 12:41:58 +00:00			`import json as _json`
			`import bs4, requests`
Move src/Utils.py in to src/utils/, splitting functionality out in to modules of related functionality 2018-10-03 12:22:37 +00:00
			`USER_AGENT = ("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 "`
			`"(KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36")`
			`REGEX_HTTP = re.compile("https?://", re.I)`

Set a max size of 100mb for utils.http.get_url 2018-10-10 13:05:15 +00:00			`RESPONSE_MAX = (10241024)100`

Use signal.alarm to Deadline utils.http.get_url and throw useful exceptions 2018-10-10 13:25:44 +00:00			`class HTTPException:`
			`pass`
			`class HTTPTimeoutException(HTTPException):`
			`pass`
			`class HTTPParsingException(HTTPException):`
			`pass`

Fix syntax error for throwing a timeout when signal.alarm fires 2018-10-10 14:07:04 +00:00			`def throw_timeout():`
			`raise HTTPTimeoutException()`
Use signal.alarm to Deadline utils.http.get_url and throw useful exceptions 2018-10-10 13:25:44 +00:00
Change utils.http to use requests 2018-10-10 12:41:58 +00:00			`def get_url(url, method="GET", get_params={}, post_data=None, headers={},`
Add fallback_encoding to utils.http.get_url, in case a page has no implicit encoding 2018-10-10 22:49:42 +00:00			`json_data=None, code=False, json=False, soup=False, parser="lxml",`
			`fallback_encoding="utf8"):`
Change utils.http to use requests 2018-10-10 12:41:58 +00:00
Move src/Utils.py in to src/utils/, splitting functionality out in to modules of related functionality 2018-10-03 12:22:37 +00:00			`if not urllib.parse.urlparse(url).scheme:`
			`url = "http://%s" % url`

Change utils.http to use requests 2018-10-10 12:41:58 +00:00			`if not "Accept-Language" in headers:`
			`headers["Accept-Language"] = "en-GB"`
			`if not "User-Agent" in headers:`
			`headers["User-Agent"] = USER_AGENT`

Fix syntax error for throwing a timeout when signal.alarm fires 2018-10-10 14:07:04 +00:00			`signal.signal(signal.SIGALRM, throw_timeout)`
Use signal.alarm to Deadline utils.http.get_url and throw useful exceptions 2018-10-10 13:25:44 +00:00			`signal.alarm(5)`
			`try:`
			`response = requests.request(`
			`method.upper(),`
			`url,`
			`headers=headers,`
			`params=get_params,`
			`data=post_data,`
			`json=json_data,`
			`stream=True`
			`)`
			`response_content = response.raw.read(RESPONSE_MAX, decode_content=True)`
			`except TimeoutError:`
			`raise HTTPTimeoutException()`
			`finally:`
			`signal.signal(signal.SIGALRM, signal.SIG_IGN)`
Change utils.http to use requests 2018-10-10 12:41:58 +00:00
			`if soup:`
Set a max size of 100mb for utils.http.get_url 2018-10-10 13:05:15 +00:00			`soup = bs4.BeautifulSoup(response_content, parser)`
Change utils.http to use requests 2018-10-10 12:41:58 +00:00			`if code:`
Return response code from utils.http.get_url when code=True and soup=True 2018-10-09 21:16:04 +00:00			`return response.code, soup`
			`return soup`

Add fallback_encoding to utils.http.get_url, in case a page has no implicit encoding 2018-10-10 22:49:42 +00:00			`data = response_content.decode(response.encoding or fallback_encoding)`
Change utils.http to use requests 2018-10-10 12:41:58 +00:00			`if json and data:`
Move src/Utils.py in to src/utils/, splitting functionality out in to modules of related functionality 2018-10-03 12:22:37 +00:00			`try:`
Change utils.http to use requests 2018-10-10 12:41:58 +00:00			`data = _json.loads(data)`
Use signal.alarm to Deadline utils.http.get_url and throw useful exceptions 2018-10-10 13:25:44 +00:00			`except _json.decoder.JSONDecodeError as e:`
			`raise HTTPParsingException(str(e))`
Change utils.http to use requests 2018-10-10 12:41:58 +00:00
			`if code:`
			`return response.status_code, data`
Move src/Utils.py in to src/utils/, splitting functionality out in to modules of related functionality 2018-10-03 12:22:37 +00:00			`else:`
			`return data`

			`def strip_html(s):`
			`return bs4.BeautifulSoup(s, "lxml").get_text()`