Stop overriding default port configurations by distracteddev · Pull Request #40 · simplecrawler/simplecrawler · GitHub
Skip to content
This repository was archived by the owner on Mar 7, 2021. It is now read-only.

Stop overriding default port configurations#40

Merged
cgiffard merged 1 commit into
simplecrawler:masterfrom
distracteddev:master
Jun 3, 2013
Merged

Stop overriding default port configurations#40
cgiffard merged 1 commit into
simplecrawler:masterfrom
distracteddev:master

Conversation

@distracteddev

Copy link
Copy Markdown
Contributor

I was getting internal client errors because the Queue was being populated with port:80 queueItems but the protocol was listed as https. This caused a call to https.get({port:80}) which throws an OpenSSH error:

[Error: 140735148347776:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:../deps/openssl/openssl/ssl/s23_clnt.c:766:

@cgiffard

cgiffard commented Jun 3, 2013

Copy link
Copy Markdown
Member

cgiffard added a commit that referenced this pull request Jun 3, 2013
Stop overriding default port configurations
@cgiffard cgiffard merged commit 845faca into simplecrawler:master Jun 3, 2013
@cgiffard

cgiffard commented Jun 3, 2013

Copy link
Copy Markdown
Member

I'll push to npm soonish, but I'm trying to resolve an error/semantic problem (#39) before I do so. :)

@distracteddev

Copy link
Copy Markdown
Contributor Author

No worries and thanks for the great crawler :) Autodiscovery saved me a ton of time.

One issue I'm still having however is when I try to discover 50+ links the state of the app ends up in a situation where

if (crawler._openRequests >= crawler.maxConcurrency) 

always evaluates to true and the openRequests never timeout (even after setting the timeout option to something small like 5 or 10 seconds) and the app is just left spinning in cycles waiting on requests that never seem to end.

I've tried to fix it for hours to no avail. As such, I was thinking of porting the Queue to use Async's built in queue implementation to handle making and receiving requests. Any thoughts on this?

@cgiffard

cgiffard commented Jun 3, 2013

Copy link
Copy Markdown
Member

(BTW I totally forgot to tell you, but 0.2.7 is on npm now.:))

In regards to that issue, perhaps I need to be tighter with the timeouts. I'm pretty sure the problem does not relate to the queue, and I'd like it to retain a relatively general interface so that a queue can be implemented, for example, in Redis, and shared between multiple machines running the same crawl between then.

I'll have a look into whether there's a simple way to fix this problem for you - that's most definitely 100% a bug. :)

@distracteddev

Copy link
Copy Markdown
Contributor Author

I looked into it more and I think the Queue is functioning properly, its just that some requests were taking 150+ seconds to respond. I also realized you aren't using the timeout anywhere so I added:

// Around Line 700 of crawler.js
clientRequest.setTimeout(crawler.timeout, function () {
  console.log('TIMEOUT REACHED', queueItem.url);
  clientRequest.abort();
})

and that fixed my problem right up once I set the timeout to 10 seconds.

@cgiffard

cgiffard commented Jun 4, 2013

Copy link
Copy Markdown
Member

Sure, the missing timer is the bug. :)

I've got a bit of stuff on right now but I'll see if I can get to this later today. Thanks!

@cgiffard

cgiffard commented Jun 4, 2013

Copy link
Copy Markdown
Member

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants