Safe Haskell | None |
---|---|
Language | Haskell98 |
HTTP downloader tailored for web-crawler needs.
- Handles all possible http-conduit exceptions and returns human readable error messages.
- Handles some web server bugs (returning
deflate
data instead ofgzip
, invalidgzip
encoding). - Uses OpenSSL instead of
tls
package (sincetls
doesn't handle all sites). - Ignores invalid SSL sertificates.
- Receives data in 32k chunks internally to reduce memory fragmentation on many parallel downloads.
- Download timeout.
- Total download size limit.
- Returns HTTP headers for subsequent redownloads
and handles
Not modified
results. - Can be used with external DNS resolver (hsdns-cache for example).
- Keep-alive connections pool (thanks to http-conduit).
Typical workflow in crawler:
withDnsCache $ c -> withDownloader $ d -> do ... -- got URL from queue ra <- resolveA c $ hostNameFromUrl url case ra of Left err -> ... -- uh oh, bad host Right ha -> do ... -- crawler politeness stuff (rate limits, queues) dr <- download d url (Just ha) opts case dr of DROK dat redownloadOptions -> ... -- analyze data, save redownloadOpts for next download DRRedirect .. -> ... DRNotModified -> ... DRError e -> ...
It's highly recommended to use
http://hackage.haskell.org/package/concurrent-dns-cache
(preferably with single resolver pointing to locally running BIND)
for DNS resolution since getAddrInfo
used in http-conduit
can be
buggy and ineffective when it needs to resolve many hosts per second for
a long time.
Synopsis
- urlGetContents :: String -> IO ByteString
- urlGetContentsPost :: String -> ByteString -> IO ByteString
- download :: Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResult
- downloadG :: (Request -> ResourceT IO Request) -> Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO DownloadResult
- rawDownload :: (Request -> ResourceT IO Request) -> Downloader -> String -> Maybe HostAddress -> DownloadOptions -> IO (DownloadResult, Maybe RawDownloadResult)
- data DownloadResult
- data RawDownloadResult = RawDownloadResult {}
- type DownloadOptions = [String]
- data DownloaderSettings = DownloaderSettings {}
- data Downloader
- withDownloader :: (Downloader -> IO a) -> IO a
- withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a
- newDownloader :: DownloaderSettings -> IO Downloader
- postRequest :: ByteString -> Request -> Request
- sinkByteString :: MonadIO m => Int -> ConduitT ByteString Void m (Maybe ByteString)
Download operations
urlGetContents :: String -> IO ByteString Source #
Download single URL with default DownloaderSettings
.
Fails if result is not DROK
.
urlGetContentsPost :: String -> ByteString -> IO ByteString Source #
Post data and download single URL with default DownloaderSettings
.
Fails if result is not DROK
.
:: Downloader | |
-> String | URL |
-> Maybe HostAddress | Optional resolved |
-> DownloadOptions | |
-> IO DownloadResult |
Perform download
post :: Downloader -> String -> Maybe HostAddress -> ByteString -> IO DownloadResult Source #
Perform HTTP POST.
:: (Request -> ResourceT IO Request) | Function to modify |
-> Downloader | |
-> String | URL |
-> Maybe HostAddress | Optional resolved |
-> DownloadOptions | |
-> IO DownloadResult |
Generic version of download
with ability to modify http-conduit Request
.
:: (Request -> ResourceT IO Request) | Function to modify |
-> Downloader | |
-> String | URL |
-> Maybe HostAddress | Optional resolved |
-> DownloadOptions | |
-> IO (DownloadResult, Maybe RawDownloadResult) |
Even more generic version of download
, which returns RawDownloadResult
.
RawDownloadResult
is optional since it can not be determined on timeouts
and connection errors.
data DownloadResult Source #
Result of download
operation.
DROK ByteString DownloadOptions | Successful download with data and options for next download. |
DRRedirect String | Redirect URL |
DRError String | Error |
DRNotModified | HTTP 304 Not Modified |
Instances
Eq DownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader (==) :: DownloadResult -> DownloadResult -> Bool # (/=) :: DownloadResult -> DownloadResult -> Bool # | |
Read DownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader readsPrec :: Int -> ReadS DownloadResult # readList :: ReadS [DownloadResult] # | |
Show DownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader showsPrec :: Int -> DownloadResult -> ShowS # show :: DownloadResult -> String # showList :: [DownloadResult] -> ShowS # |
data RawDownloadResult Source #
Result of rawDownload
operation.
Instances
Eq RawDownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader (==) :: RawDownloadResult -> RawDownloadResult -> Bool # (/=) :: RawDownloadResult -> RawDownloadResult -> Bool # | |
Show RawDownloadResult Source # | |
Defined in Network.HTTP.Conduit.Downloader showsPrec :: Int -> RawDownloadResult -> ShowS # show :: RawDownloadResult -> String # showList :: [RawDownloadResult] -> ShowS # |
type DownloadOptions = [String] Source #
If-None-Match
and/or If-Modified-Since
headers.
Downloader
data DownloaderSettings Source #
Settings used in downloader.
DownloaderSettings | |
|
Instances
Default DownloaderSettings Source # | |
Defined in Network.HTTP.Conduit.Downloader |
data Downloader Source #
Keeps http-conduit Manager
and DownloaderSettings
.
withDownloader :: (Downloader -> IO a) -> IO a Source #
Create a new Downloader
, use it in the provided function,
and then release it.
withDownloaderSettings :: DownloaderSettings -> (Downloader -> IO a) -> IO a Source #
Create a new Downloader
with provided settings,
use it in the provided function, and then release it.
newDownloader :: DownloaderSettings -> IO Downloader Source #
Create a Downloader
with settings.
Utils
postRequest :: ByteString -> Request -> Request Source #
Make HTTP POST request.
sinkByteString :: MonadIO m => Int -> ConduitT ByteString Void m (Maybe ByteString) Source #
Sink data using 32k buffers to reduce memory fragmentation.
Returns Nothing
if downloaded too much data.