• 6. Design A Web Crawler



    title: Notes of System Design No.10 — Design a Web Crawler
    description: ‘Design a Web Crawler’
    date: 2022-05-13 18:01:58
    tags: 系统设计
    categories:

    • 系统设计

    00. What is Web Crawler?

    Q :uh just for now let's just do html pages
    
    but your web crawler should be
    
    extensible so that in the future it can
    
    handle other types of media
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    A!and um what types of protocols are we
    
    thinking we want to be able to crawl on
    
    the internet for
    
    • 1
    • 2
    • 3
    • 4
    • 5
    Q! let's just stick to http for now but
    
    again it should be extensible so maybe
    
    in the future it can handle ftp or https
    
    etc
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    
    A !great and also i'll pull up my um
    
    screen and i'll start writing down some
    
    of these requirements
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    so we
    
    we know we want to have all right http
    
    we're going to handle http protocol
    
    and we also
    
    are handling just html files for now
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
     but
    
    ideally in the future we'd be able to
    
    expand it to other types of files
    
    • 1
    • 2
    • 3
    • 4
    • 5
    and so
    
    or other types of media 
    
    • 1
    • 2
    • 3
    and in these
    
    cases um
    
    let's see maybe we could start thinking
    
    about some of the features we'd like to
    
    support in the web crawler
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
     so how does
    
    that sound to you 
    
    
    • 1
    • 2
    • 3
    • 4
    yeah that sounds
    
    perfect
    
    • 1
    • 2
    • 3

    01.Functional Requirement

    great um so some things we can think
    
    about
    
    
    • 1
    • 2
    • 3
    • 4
    
    first off we'll probably want something
    
    like
    
    we want to be able to
    
    monitor our politeness
    
    or our crawl rate
    as we're navigating the web
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    in a lot of cases we don't want to over
    
    over
    
    use the resources of a given server or a
    
    service
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    
    and then
    
    on top of that
    
    a dns query
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    
    so dns query will help us resolve the
    
    websites faster
    
    • 1
    • 2
    • 3
    • 4
     we'll have distributed crawling
    
    we want to make sure that we can do this
    
    efficiently and there's no single point
    
    of failure where something goes down in
    
    the system
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    we also want priority crawling so how do
    
    we handle
    
    multiple uh which sites to crawl you
    
    know how do we rank those
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    and then
    
    um
    
    one more thing i can think of
    
    at the top of my head is probably
    
    duplicates
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    so
    
    not continuing to
    
    uh go over sites that we've already seen
    
    or in the case that we're on one given
    
    page we don't cycle through one page
    
    over and over 
    
    because it may be
    
    self-references or something like that
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    02. Non-Functional Requirement

    um given
    
    that we're gonna want to have
    
    probably scalability since we're
    
    processing
    
    a lot of data and we're also going
    
    through
    
    millions of sites
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    in addition to that
    
    we're going to want extensibility
    
    so extensibility meaning maybe we have
    
    this you know we talked about a
    
    duplicate detection and
    
    priority crawling but there could be
    
    additional things that we'd want to
    
    check out for in the future you know
    
    that we can't anticipate now like
    
    malware or security
    
    and we want to make that easily added in
    
    addition to these existing features
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    in addition to that freshness of pages
    
    so how do we continually
    
    renew these pages
    
    • 1
    • 2
    • 3
    • 4
    • 5
    and then
    
    um server side rendering
    
    • 1
    • 2
    • 3
    so like i talked about before with
    
    single page applications we're going to
    
    need to render them on our site on the
    
    server side to be able to
    
    scan through them
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
     otherwise they're
    
    we're not going to be able to read it
    
    because it'll just be a javascript
    
    bundle
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    besides that
    
    maybe we can take a look at uh like we
    
    said before
    
    politeness
    
    which will be an important part
    
    
    and so that'll be relating to like how
    
    we
    
    respect the upper limit of
    
    the visiting frequencies that we have of
    
    these websites because we don't want to
    
    consume too many
    
    um resources on those servers
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    so do all these goals seem okay for us
    
    to proceed with the system design itself
    
    piece by piece
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    
    yeah that sounds good to me
    
    • 1
    • 2

    03. Assumptions

     so given this feature set i'd love to
    
    start talking about maybe the scale of
    
    the system of how we can expand it to
    
    you know how many users that we're going
    
    to need for this
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    so
    
    starting off
    
    from here
    
    • 1
    • 2
    • 3
    • 4
    • 5
    so we're probably going to say that
    
    uh let's say the internet
    
    has probably
    
    900 million uh
    
    registered websites
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    
    and of those 900 million websites we can
    
    probably estimate that 60%
    
    of them
    
    or
    
    around 500 million are actually
    
    like up and running
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
     if they're up and running
    
    they're also probably crawlable
    
    or maybe they have other issues
    
    • 1
    • 2
    • 3
    • 4
    • 5
    so then
    
    given each of those sites so we have a
    
    total of 500 million sites
    
    
    
     each of those
    
    sites we can say maybe on average has a
    
    hundred pages
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    
    and they can probably vary from anywhere
    
    from one to two pages to let's say a
    
    thousand pages like on something really
    
    big like facebook or other
    
    even facebook's probably even bigger
    
    than that 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    but
    
    100 probably is sufficient 
    
    would you agree
    
    • 1
    • 2
    • 3
    • 4
    • 5
    yes yeah that sounds good 
    
    • 1
    
    awesome and so
    
    given that um
    
    we want to say
    
    okay we have 
    500 million sites
    100 pages per site
    
    uh that gives us about a total of 50 billion
    
    50 billion pages to crawl
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    so given 50 billion pages
    
    we're going to also need to be able to
    
    factor in the page size of each of these
    
    individual pages
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    so maybe we can estimate these pages to
    
    be about 120
    
    kilobytes per page
    
    • 1
    • 2
    • 3
    • 4
    • 5
    this case
    
    since we're just crawling for html we'll
    
    also only need to
    
    pull in html
    
    as well as in some cases we need to do
    
    javascript and css especially on like
    
    single page applications where we'll
    
    need to render it
    
    before it can actually we can see the
    
    whole site
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    and so
    
    um
    
    that will get us around
    
    let's see 50 billion times 120
    
    kilobytes gives us around
    
    [Music]
    
    what's the math there i think it's like
    
    six petabytes
    
    worth of storage
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    so
    
    given
    
    that um
    
    [Music]
    
    we can save all this and and let's say
    
    it's like plus or minus one
    
    tolerance of that
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    
    so given all this
    
    information we have to store it
    
    somewhere probably um
    
    yeah uh
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    but maybe before we think about that as
    
    well like how can we optimize this
    
    because that's a lot of data
    
    • 1
    • 2
    • 3
    • 4
    • 5
    um and maybe one optimization we can
    
    take a look at is uh just being able to
    
    take out the metadata to build like some
    
    sort of annotated index so we don't need
    
    the complete page stored for every
    
    single page
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    and in the case that we do need to have
    
    the full page
    
    well we can compress all of these pages
    
    and then if we do need to
    
    pull the full information of a page then
    
    we can unzip it and just take a look at
    
    the contents and
    
    maybe we can estimate after zipping and
    
    compressing all of that we'll get down
    
    to half which is about three petabytes
    
    of storage
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    um so does all of that seem reasonable
    
    as far as scale
    
    • 1
    • 2
    • 3
    yeah sounds good to me so far
    
    • 1

    04. Define API

    05. High-Level Design

    great
    
    um so
    
    probably the first thing that you need
    
    in any web crawler is a way to 
    have a way to see
    
    the urls
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    in this case
    
    any any system is going to need to 
    
    take into account urls to be able to start the crawling
    
    process
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    and
    
    let's say we can take
    
    in this case since we're doing html
    
    sites
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    maybe our use case is like a search
    
    engine something like google
    
    or bing
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    and
    
    in that case then we want to say crawl
    
    the top 100 sites in every category on
    
    the internet
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
     so like finance
    
    entertainment fashion blogs
    
    all of those sort of things
    
    • 1
    • 2
    • 3
    • 4
    • 5
    
    and then make a list of those and
    
    compile all that into seed urls
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
     and so all of these urls is
    
    basically one document that we can feed
    
    into our next piece
    
    which is the
    
    url frontier
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    the url frontier in this case
    
    um essentially  it's
    
    a data structure that's built using queues
    
    so it has
    
    a priority and correctness built in 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    and
    
    iyou want to explore
    
    that further together like we could go
    
    down that
    
    but i think oh up to you yeah 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    oh i'd
    
    just be curious to know what are you
    
    prioritizing that in the url frontier
    
    • 1
    • 2
    • 3
    • 4
    • 5
    yes great so in this case like priority
    
    it will be determined by
    
    what do we want to in terms of urls to
    
    be able to prioritize
    
    um like which sites are we crawling so
    
    in this case i would imagine we'd
    
    probably want to prioritize based on
    
    maybe traffic to the sites or
    
    um you know other rankings like i think
    
    google
    
    has certain rankings based off of like
    
    how many people are referencing a given
    
    url from other urls and so that could be
    
    let's say a priority ranking we could do
    
    with those 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    
     if a url is also really
    
    old we need to refresh
    
    and it's being hit a lot 
    um that could
    
    be a way we could do that 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    how does that
    
    sound yeah 
    
    • 1
    • 2
    • 3
    that all sounds really
    
    sensible 
    
    • 1
    • 2
    • 3
    given that okay so we have this
    
    structure that allows us to
    
    uh prioritize the urls to crawl 
    
    next and
    
    returns that url
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    here we go call that url
    
    pass that url to
    
    this
    
    and
    
    this is going to be a fetcher or
    
    renderer
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    
    
     so this is
    
    essentially a piece in the process
    
    that is going to
    
    handle the fetching and rendering
    
    which i guess makes sense in the name
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
     what we need to do is that
    
    in this case
    
    we can have several of these distributed
    
    in parallel
    
    but
    
    uh
    
    each of these are implemented 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    we can
    
    implement it with threads or processes
    
    like python
    
    has a way of
    
    implementing uh individual
    
    like multi-threaded systems and so we
    
    could implement that
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    
    
    um
    
    does it does that sound all right
    
    • 1
    • 2
    • 3
    • 4
    • 5
    yeah so um is the fetcher getting uh the
    
    content of the url directly here
    
    yeah so
    
    • 1
    • 2
    • 3
    • 4
    • 5
    
    
    i we can say maybe in each of these
    
    components that it does
    
    two things it's fetching as yes you're
    
    saying
    
    we need to be able to get the content
    
    from the url 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    once this url is passed to
    
    from the url frontier
    
    um then
    
    we could just immediately
    
    call the page   、 download it
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    
    but again as we talked about before
    
    um we would maybe end up getting routed
    
    through a bunch of dns
    
    or a bunch of ips so we probably need a
    
    dns resolver to be able to fix that
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    um so maybe
    
    as an intermediate step to speed up that
    
    process we can add in a dns resolver
    
    here
    
    but just as you said the fetcher is
    
    going to basically pull that page and
    
    whatever is existing on there
    
    and to be crawled
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    
     and if it's the case
    
    it's just a standard html page then work
    
    is probably done in terms of the fetcher
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    but
    
    if it's a single page application which
    
    is more and more common these days like
    
    a lot of people are using react and
    
    angular and other
    
    spas then we'll probably need to use
    
    some server-side rendering like
    
    gatsby or next js and we can handle that
    
    rendering on our side 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    
    and in this case uh referring to
    
    politeness as we're going through these
    
    individual pages
    
    and probably be good to set the user agent to something like
    
    a crawling name like robot where a lot of these
    
    sites will specify names where we can
    
    make sure to optimize or direct for
    
    crawler traffic 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    gotcha so that's like a way of signaling
    
    to the website that your uh crawling
    
    traffic is coming from a bot rather than
    
    a human exactly yeah 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    we don't want them
    
    to get angry with us and then block us
    
    our traffic so that'd probably be great
    
    • 1
    • 2
    • 3
    • 4
    • 5
    we have this dns resolver fetcher
    
    renderer
    
    now that we have all this content we
    
    need to store it somewhere
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    
    
    do you have a preference of a storage
    
    system of like s3 or
    
    um big table or anything
    
    um
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    
    
    not necessarily but i
    
    do think you probably want some 
    
    balance of both having persistent
    
    storage and also some short-term storage
    
    for that processes can access
    
    efficiently
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    yes yes great point i think
    
    um we have these
    
    like large you know
    
    • 1
    • 2
    • 3
    • 4
    • 5
     thinking of
    
    long-term storage i was just thinking of
    
    that but i definitely think it would be
    
    important too here i'll add that here
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    think about short shorter term storage
    
    so we can be able to access that faster
    
    
    • 1
    • 2
    • 3
    • 4
    
    so i'll call this long term
    
    and we can just put a placeholder like
    
    amazon
    
    s3 buckets
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    
    
    in addition to that though
    
    um i think you're exactly right and we
    
    definitely need
    
    definitely need some short-term storage
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    
    and
    
    so for that short-term storage i think
    
    redis will probably works like some sort
    
    of cache
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    um
    
    i think there's
    
    there's other types of cacheum like
    
    memcache but i i think redis will be
    
    sufficient for this
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    so
    
    okay great now we can read and write
    
    we're storing the data in these
    
    long-term
    
    data buckets to be able to access later
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    
    
    we have short-term storage and redis to
    
    be able to quickly read process the data
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    so how are we going to read and process
    
    the data so
    
    • 1
    • 2
    • 3
    
     from the fetcher we can
    
    just call another process here
    
    and
    
    um
    
    i imagine in this case we'll
    
    need some sort of queue for all of these
    
    responses since we're going through
    
    so many
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    if we tried to do it synchronously i
    
    think it would just slow the system down
    
    a ton
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
     so
    
    um maybe we can queue up all of these uh
    
    sent from here 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    
    as soon as we store this
    
    we can
    
    put this in a response in the response
    
    queue
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    so the two services i can immediately
    
    think of to handle that we talked about
    
    
    • 1
    • 2
    • 3
    • 4
    
    this
    
    are going to be the main one which is a
    
    url extractor
    
    and then the second one which is uh the
    
    duplication service
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    
    um would you like me to talk about any
    
    additional services potentially
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    
    um this looks good to me for now
    
    um
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    
    and so
    
    given these we have this url extractor
    
    which
    
    basically actually you know what i'll
    
    start with a deduplication service
    
    because it's pretty simple 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    
    um this
    
    deduplication service
    
    • 1
    • 2
    • 3
    • 4
     we have
    
    all these urls that are in the
    
     short term storage here
    
    and given that um as we're going through
    
    these we want to check if there's any
    
    duplicates in the short term storage and
    
    then not
    
    extract those urls um
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    
    
    
    in terms of this url extractor
    
    that is probably  where we want to pull
    
    urls from the page to allow the crawler
    
    to continue diving further into that
    
    same website or additionally leading to
    
    other websites
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    
    most likely we'll use something like
    
    depth first search or breadth first
    
    search
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    
    i think from in this case when uh
    
    in crawlers i've enjoyed breadth first
    
    search to be able to exhaust a given
    
    page before moving on to the next one
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    
    in addition things like if they're all
    
    from the same domain then just
    
    characterizing them all under one domain
    
    and just having their relative paths
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    
    
    
    so in addition to that i guess
    
     we could so talking or moving back to
    
    our points here we have extensibility as
    
    an important point
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    
    and this at right at this point is where
    
    we can allow extensibility
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    
     so from the
    
    response queue 
    
    we could add any
    
    additional services we wanted to
    
    right here to be able to 
    
    let's say we talked about
    
    malware detection that could be a thing
    
    where we don't probably want to download
    
    any viruses or anything off the internet
    
    that's going to mess up our system
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    
    
    
    
    so in addition to that
    
    we can
    
    take these urls that we've now extracted
    
    we'll now want to start actually
    
    reprocessing those urls because this is
    
    just a circle or never ending cycle until we've
    
    crawled the whole internet  i guess 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    so
    
    given this url extractor
    
    um
    
    okay we specified we talked about we
    
    wanted just html pages
    
     so this is
    
    probably an important point here to
    
    add in a url filter
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    so given this url filter
    
    we've  cleaned up and normalized
    
    all these urls 
    and now we can give a
    
    second check here
    
    where
    
    if we didn't 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    
    
    
    we've had all these filtered um we need
    
    to do
    
    one final check to be able to load all
    
    these urls
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    
    so
    
    we've in this given pass we've removed
    
    any duplicates but what happens when we
    
    have also duplicates that we have
    
    already seen previously or how do we
    
    manage yeah if we have crawled a page
    
    but maybe we do want to crawl a page
    
    again um how do we
    
    check on that freshness and then save
    
    the the new version
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    
    
    and so
    
    i guess in this case we'll want a url
    
    loader
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    so what does this url loader do
    
    essentially it checks against if we've already crawled
    
    the url
    
    so we
    
    probably need to build some sort of
    
    search algorithm in this case
    
    because it would be end time to search
    
    the whole
    
    database and since we are or whatever
    
    storage system we're using and since we
    
    have such high scale here it'll take way
    
    too long to be able to search through
    
    all of those to see if we already have
    
    it
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    
    so um in this case i would probably say
    
    we could use something called 
    
    bloom filter data structure
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    
    and
    
    that case it essentially
    
    the drawback of it um
    
    
    it's not always correct 
    
     it sacrifices
    
    correctness for speed
    
    so
    
    in this case we
    
    we could miss some urls but
    
     i guess
    
    
    in this case that's okay because we want
    
    we'll probably pick up those urls in
    
    another pass 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    
    does that seem all right i
    
    guess as a trade-off in this case
    
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    um
    
    so given that
    
    uh that's pretty much the well actually
    
    we need to store all of these
    
     so we
    
    probably use something like nosql or
    
    actually i would probably choose
    
    in this case a distributed file system
    
    to store all of these
    
    uh different
    
    different urls 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    
    and as i talked about
    
    before i think we could do a use a
    
    strategy where we have the domain name
    
    at the top level of the file name and
    
    then all of the
    
    relative urls or urls tied to that
    
    domain name 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    
    
    does that seem like it would be i guess
    
    reasonable in this case 
    
    
    
    yes yeah sounds
    
    reasonable 
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    so now we have all of these urls saved
    
    and crawled and if we've already crawled
    
    them then we remove them in this case
    
    however
    
    or we add them here and then  we don't want to pass it to the
    
    beginning of the structure which is over
    
    here at the url frontier
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    
    
    but we will want to i guess in some
    
    cases
    
    uh if we've already run across it we
    
    will want to update it
    
    in here if
    
    maybe we can add timestamps to those to
    
    be able to determine the freshness of
    
    how often we have updated that or when
    
    the last time was and those timestamps
    
    then could be handled by the url
    
    frontier about how they would get ranked
    
    or prioritized to being refreshed or recrawled
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    so you mentioned before that politeness
    
    was one of your design goals so how do
    
    you ensure that your crawler is
    
    respecting each individual website's
    
    limitations on how much and where you
    
    can crawl
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    
    yeah definitely so i i think in those
    
    cases um
    
    moving back over in this fetcher
    
    renderer we talked about
    
    defining the user agent 
    
    
    but i think yeah
    
    we could be even more polite
    
    using
    
    like a
    
    checking if the website had a robots.txt
    
    file then we could see
    
    the robot exclusion protocol so
    
    basically  it
    
    would define what parts of the website
    
    we should not crawl
    
    and also a delay time um which i think i
    
    mentioned before but yeah thank you for
    
    bringing up again because uh the crawl
    
    rate is an important factor of being
    
    polite and i didn't go into it here but
    
    i definitely think that
    
    in that case we want to respect whatever
    
    their crawl rate that they've defined
    
    for us in that protocol
    
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52

    ![](https://files.mdnice.com/user

    06. Low-Level Design

    07 . Dive Deep

  • 相关阅读:
    Bean的作用域Request级别出现错误
    VR全景数字工厂,制造业企业线上营销新助手
    服务器搭建(TCP套接字)-基础版(服务端)
    中视频伙伴计划开通收益功能的方法和使用介绍
    神马自然排名如何提高?
    test_pipeline
    jdk 8 List相关知识点
    axios+vite配置反向代理踩坑记录
    ZZNUOJ_C语言1043:最大值(完整代码)
    已知三角形三条边求面积
  • 原文地址:https://blog.csdn.net/vjhghjghj/article/details/127935783