Commit graph

90 commits

Author SHA1 Message Date
jesopo
e4a5bd01e9 explicitly use "lxml" for finding page encoding 2019-11-26 14:34:48 +00:00
jesopo
8e9da0d681 _find_encoding takes bytes and soupifies now 2019-11-26 13:58:37 +00:00
jesopo
c898bc4be1 utils.http.request_many() shouldn't decode data for Response 2019-11-26 13:54:17 +00:00
jesopo
2d21dfa229 utils.http.Response.data should always be bytes - add .decode and .soup 2019-11-26 13:42:01 +00:00
jesopo
ed775ddbe3 remove parser from utils.http.Request, add Request.soup() 2019-11-26 11:35:56 +00:00
jesopo
6a6e789ec9 add cookies and .json() to utils.http.Response objects 2019-11-25 18:17:30 +00:00
jesopo
ab8bc65cc9 change utils.http.Request to be a dataclass 2019-11-25 13:42:10 +00:00
jesopo
4d30263315 give bitbot a unique User-Agent
closes #206
2019-11-20 14:42:34 +00:00
Valentin Lorentz
fbf8cd1a16 Fix type errors detected by 'mypy --ignore-missing-imports src'. 2019-10-30 22:26:59 +01:00
jesopo
f64131a10f support utf8 hostnames by punycode (idna) encoding 2019-10-18 10:58:24 +01:00
jesopo
9ab817ca58 parse out content_type in Response ctor 2019-10-05 22:56:56 +01:00
jesopo
b2473a4ac4 parse content-type out in utils.http.request, put it on Response object 2019-10-04 13:07:09 +01:00
jesopo
f306213cb8 'is_localhost()' -> 'host_permitted()' 2019-09-30 15:15:20 +01:00
jesopo
b9c64b7cf1 use ipaddress is_loopback etc to do better forbidden ranges
closes #87
2019-09-30 15:12:01 +01:00
jesopo
2f49fb99e9 assume http fallback_encoding by content-type (utf8 for json) 2019-09-25 15:32:09 +01:00
jesopo
72649a90c2 only BeautifulSoup for finding encoding when it's a html-ish type 2019-09-20 13:38:00 +01:00
jesopo
e34259f967 log call was replaced with Exception but [] on args remained 2019-09-19 15:30:27 +01:00
jesopo
88a69aaa66 give Requests, use them in utils.http.request_many() 2019-09-19 14:54:44 +01:00
jesopo
d8e3a1c7ee utils.http.request_() has no self, let alone self.log 2019-09-19 14:02:48 +01:00
jesopo
b69c9146b2 should be using pair_start/pair_end throughout for 2019-09-19 13:51:27 +01:00
jesopo
cd0d39ee5e also show "bad" data in HTTPParsingException when a message is provided 2019-09-18 14:20:59 +01:00
jesopo
312f8906ae show "bad" data in HTTPParsingException message 2019-09-18 10:52:05 +01:00
jesopo
ee6360be22 don't check already-read data when checking for too-large requests
this check was here because the first read will return empty if it was an
invalid byte sequence for e.g. gzip because we needed to receive more data. the
second read will always return data (not decoded) so regardless of what the
already-read data is, the second read is the only criteria we need.
2019-09-17 17:33:23 +01:00
jesopo
1ac7f2697e log which URL caused an error in request_many 2019-09-17 17:09:19 +01:00
jesopo
98545a9fb4 only decode content-types in DECODE_CONTENT_TYPES 2019-09-17 16:12:03 +01:00
jesopo
8ca0d30fef Response.__init__() needs encoding now 2019-09-17 14:11:12 +01:00
jesopo
b7dd78ef1a restore 5 second (instead of default 10) deadline for http.request 2019-09-17 13:44:14 +01:00
jesopo
94c3ff962b use utils.deadline_process() in utils.http._request() so background threads can
call _request()
2019-09-17 13:41:11 +01:00
jesopo
47735421b8 add json_body arg to Request to json-encode body, only return from body if
not null
2019-09-16 10:57:18 +01:00
jesopo
77f50187c5 allow Requests to specify a useragent 2019-09-12 10:41:50 +01:00
jesopo
9d6a3982ed add a helper utils.http.Client static object 2019-09-11 17:53:49 +01:00
jesopo
51dc26d113 add proxy to Request objects 2019-09-11 17:53:37 +01:00
jesopo
4a97c9eb0d refactor utils.http.requests to support a Request object 2019-09-11 17:44:07 +01:00
jesopo
8f8cf92ae2 automatically decode certain http content types 2019-09-11 15:28:13 +01:00
jesopo
a9b106c6be Don't try to .decode non-html things, default iso-lat-1 for non-html too 2019-09-09 16:17:26 +01:00
jesopo
b83f5d9e30 add flag to disable encoding detection 2019-09-09 14:59:08 +01:00
jesopo
5ef2b7af27 'str.split' -> 's.split' 2019-09-09 14:53:11 +01:00
jesopo
1df82c1cb2 still default to iso-latin-1 if no on-page or in-header content-type is present 2019-09-09 14:48:26 +01:00
jesopo
0a67659637 only look for <meta>-related tags when there are meta tags 2019-09-09 14:39:19 +01:00
jesopo
0a1077c5cd add explicit None return for _find_encoding (mypy) 2019-09-09 14:25:01 +01:00
jesopo
ff9c82bf67 change utils.http.request to best-effort detect on-page encoding
closes #113
2019-09-09 14:11:18 +01:00
jesopo
397cfa8e7e correctly qualify DeadlineExceededException namespace 2019-09-03 14:54:59 +01:00
jesopo
b7b2f31c1c use utils.deadline() in utils.http.request, not raw sigalrm 2019-09-02 15:50:21 +01:00
jesopo
9cc1ee98eb Pass the content of a webpage to HTTPParsingException 2019-09-02 13:27:44 +01:00
jesopo
408b89aeb7 use \S+ for url regex (for non-ascii chars), use url_sanitize to catch <> 2019-09-02 13:25:48 +01:00
jesopo
20042edfd9 Allow bypass of content-type check in utils.http.request 2019-08-05 15:41:02 +01:00
jesopo
d093027431 not all HTTP responses have content-type 2019-08-02 17:33:16 +01:00
jesopo
c19c6c0e14 asyncio.gather -> asyncio.wait (with timeout) 2019-07-08 14:50:11 +01:00
jesopo
469c725675 tell asyncio.gather which loop to use 2019-07-08 14:41:12 +01:00
jesopo
a1438abf66 close event loop when we're done with it (request_many()) 2019-07-08 13:59:48 +01:00