Commit graph

68 commits

Author SHA1 Message Date
jesopo
ee6360be22 don't check already-read data when checking for too-large requests
this check was here because the first read will return empty if it was an
invalid byte sequence for e.g. gzip because we needed to receive more data. the
second read will always return data (not decoded) so regardless of what the
already-read data is, the second read is the only criteria we need.
2019-09-17 17:33:23 +01:00
jesopo
1ac7f2697e log which URL caused an error in request_many 2019-09-17 17:09:19 +01:00
jesopo
98545a9fb4 only decode content-types in DECODE_CONTENT_TYPES 2019-09-17 16:12:03 +01:00
jesopo
8ca0d30fef Response.__init__() needs encoding now 2019-09-17 14:11:12 +01:00
jesopo
b7dd78ef1a restore 5 second (instead of default 10) deadline for http.request 2019-09-17 13:44:14 +01:00
jesopo
94c3ff962b use utils.deadline_process() in utils.http._request() so background threads can
call _request()
2019-09-17 13:41:11 +01:00
jesopo
47735421b8 add json_body arg to Request to json-encode body, only return from body if
not null
2019-09-16 10:57:18 +01:00
jesopo
77f50187c5 allow Requests to specify a useragent 2019-09-12 10:41:50 +01:00
jesopo
9d6a3982ed add a helper utils.http.Client static object 2019-09-11 17:53:49 +01:00
jesopo
51dc26d113 add proxy to Request objects 2019-09-11 17:53:37 +01:00
jesopo
4a97c9eb0d refactor utils.http.requests to support a Request object 2019-09-11 17:44:07 +01:00
jesopo
8f8cf92ae2 automatically decode certain http content types 2019-09-11 15:28:13 +01:00
jesopo
a9b106c6be Don't try to .decode non-html things, default iso-lat-1 for non-html too 2019-09-09 16:17:26 +01:00
jesopo
b83f5d9e30 add flag to disable encoding detection 2019-09-09 14:59:08 +01:00
jesopo
5ef2b7af27 'str.split' -> 's.split' 2019-09-09 14:53:11 +01:00
jesopo
1df82c1cb2 still default to iso-latin-1 if no on-page or in-header content-type is present 2019-09-09 14:48:26 +01:00
jesopo
0a67659637 only look for <meta>-related tags when there are meta tags 2019-09-09 14:39:19 +01:00
jesopo
0a1077c5cd add explicit None return for _find_encoding (mypy) 2019-09-09 14:25:01 +01:00
jesopo
ff9c82bf67 change utils.http.request to best-effort detect on-page encoding
closes #113
2019-09-09 14:11:18 +01:00
jesopo
397cfa8e7e correctly qualify DeadlineExceededException namespace 2019-09-03 14:54:59 +01:00
jesopo
b7b2f31c1c use utils.deadline() in utils.http.request, not raw sigalrm 2019-09-02 15:50:21 +01:00
jesopo
9cc1ee98eb Pass the content of a webpage to HTTPParsingException 2019-09-02 13:27:44 +01:00
jesopo
408b89aeb7 use \S+ for url regex (for non-ascii chars), use url_sanitize to catch <> 2019-09-02 13:25:48 +01:00
jesopo
20042edfd9 Allow bypass of content-type check in utils.http.request 2019-08-05 15:41:02 +01:00
jesopo
d093027431 not all HTTP responses have content-type 2019-08-02 17:33:16 +01:00
jesopo
c19c6c0e14 asyncio.gather -> asyncio.wait (with timeout) 2019-07-08 14:50:11 +01:00
jesopo
469c725675 tell asyncio.gather which loop to use 2019-07-08 14:41:12 +01:00
jesopo
a1438abf66 close event loop when we're done with it (request_many()) 2019-07-08 13:59:48 +01:00
jesopo
81c7af8ab5 Don't try/except async http exceptions 2019-07-08 13:51:02 +01:00
jesopo
ee0ec0eca1 switch request_many() to use asyncio.gather 2019-07-08 13:46:27 +01:00
jesopo
b62ba469d7 catch async exceptions in utils.http.request_many() 2019-07-08 13:18:59 +01:00
jesopo
078681eddf add missing schema in utils.http.sanitise_url, use in rss.py 2019-07-08 12:54:06 +01:00
jesopo
ecb8364d0d switch to using asyncio's event loop 2019-07-08 12:45:10 +01:00
jesopo
15e143fcff implement utils.http.request_many as a tonado ioloop yield 2019-07-08 11:43:09 +01:00
jesopo
637067c62c url_validate() -> url_sanitise() 2019-07-02 14:15:49 +01:00
jesopo
534854127b Add utils.http.url_validate() for best-effort url tidying 2019-07-02 14:10:18 +01:00
jesopo
f9eb017466 message arg for HTTPWrongContentTypeException/HTTPParsingException 2019-06-28 23:01:21 +01:00
jesopo
97810db8df Give descriptions to utils.http.HTTPException subclasses 2019-06-27 18:28:08 +01:00
jesopo
16d331dd43 add allow_redirects kwarg to utils.http.request() 2019-06-26 17:53:16 +01:00
jesopo
a802e66dcf Defer decoding http payload bytestring until after checking ContentType 2019-06-04 13:47:03 +01:00
jesopo
0be9046669 Pass str object to BeautifulSoup, not bytes. closes #56 2019-05-28 10:22:35 +01:00
Patrick Nappa
2c344c9ddd forgot the beautiful % 2019-05-03 13:50:51 +10:00
Patrick Nappa
471c11e229 ensure that non-url characters not separated by whitespace aren't consumed 2019-05-03 13:43:08 +10:00
jesopo
bdcb4b5db2 Add missing ":" 2019-04-25 17:50:41 +01:00
jesopo
1240b154cb Support interfaces that don't have AF_INET and/or AF_INET6 2019-04-25 17:48:51 +01:00
jesopo
7643a962bd Refuse to get the title for any url that points locall 2019-04-25 15:58:58 +01:00
jesopo
dffee4d223 Move REGEX_URL out of isgd.py and title.py in to utils.http 2019-04-24 15:46:54 +01:00
jesopo
197ae2e053 Raise a specific exception in utils.http.request for "wrong content type" 2019-02-28 23:28:45 +00:00
jesopo
846b881e52 Throw ValueError when utils.http.request tries to soup non-html/xml data 2019-02-27 15:16:08 +00:00
jesopo
cfaf6864fc Don't try to parse non-html/xml stuff with BeautifulSoup 2019-02-26 11:18:50 +00:00