logo
down
shadow

Scrapy problems with requests


Scrapy problems with requests

By : Jorge Bolpe
Date : October 18 2020, 06:10 PM
fixed the issue. Will look into that further I'm trying to scrape all the possible combinations of outputs of 5 dropdowns (working as tree depth) and make a generic tree data structure out of it. , Use yield request instead of retutn request.
code :


Share : facebook icon twitter icon
Add a delay after 500 requests scrapy

Add a delay after 500 requests scrapy


By : shfz
Date : March 29 2020, 07:55 AM
wish help you to fix your issue You can look into using an AutoThrottle extension which does not give you a tight control of the delays but instead has its own algorithm of slowing down the spider adjusting it on the fly depending on the response time and number of concurrent requests.
If you need more control over the delays at certain stages of the scraping process, you might need a custom middleware or a custom extension (similar to AutoThrottle - source).
Haskell Yesod - CORS problems with browsers OPTIONS requests when doing POST requests

Haskell Yesod - CORS problems with browsers OPTIONS requests when doing POST requests


By : Siddhant Mohapatra
Date : March 29 2020, 07:55 AM
To fix this issue I believe the issue is that simpleCors is built off simpleCorsResourcePolicy, which only covers simpleMethods, which doesn't cover OPTIONS.
You can fix this issue by using the same methods to roll whatever middleware you need.
code :
{-# LANGUAGE OverloadedStrings #-}

module Middlewares where

import Network.Wai                       (Middleware)
import Network.Wai.Middleware.AddHeaders (addHeaders)
import Network.Wai.Middleware.Cors       (CorsResourcePolicy(..), cors)

-- | @x-csrf-token@ allowance.
-- The following header will be set: @Access-Control-Allow-Headers: x-csrf-token@.
allowCsrf :: Middleware
allowCsrf = addHeaders [("Access-Control-Allow-Headers", "x-csrf-token,authorization")]

-- | CORS middleware configured with 'appCorsResourcePolicy'.
corsified :: Middleware
corsified = cors (const $ Just appCorsResourcePolicy)

-- | Cors resource policy to be used with 'corsified' middleware.
--
-- This policy will set the following:
--
-- * RequestHeaders: @Content-Type@
-- * MethodsAllowed: @OPTIONS, GET, PUT, POST@
appCorsResourcePolicy :: CorsResourcePolicy
appCorsResourcePolicy = CorsResourcePolicy {
    corsOrigins        = Nothing
  , corsMethods        = ["OPTIONS", "GET", "PUT", "POST"]
  , corsRequestHeaders = ["Authorization", "Content-Type"]
  , corsExposedHeaders = Nothing
  , corsMaxAge         = Nothing
  , corsVaryOrigin     = False
  , corsRequireOrigin  = False
  , corsIgnoreFailures = False
}
run port $ logger . allowCsrf . corsified $ app cfg
Deferred requests in scrapy

Deferred requests in scrapy


By : grey-slate
Date : March 29 2020, 07:55 AM
I wish did fix the issue. I finally found an answer in an old PR
code :
def parse():
        req = scrapy.Request(...)
        delay = 0
        reactor.callLater(delay, self.crawler.engine.schedule, request=req, spider=self)
from scrapy import signals
from scrapy.exceptions import DontCloseSpider

class ImmortalSpiderMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_idle, signal=signals.spider_idle)
        return s

    @classmethod
    def spider_idle(cls, spider):
        raise DontCloseSpider()
from weakref import WeakKeyDictionary

from scrapy import signals
from scrapy.exceptions import DontCloseSpider
from twisted.internet import reactor

class DelayedRequestsMiddleware(object):
    requests = WeakKeyDictionary()

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)
        return ext

    @classmethod
    def spider_idle(cls, spider):
        if cls.requests.get(spider):
            spider.log("delayed requests pending, not closing spider")
            raise DontCloseSpider()

    def process_request(self, request, spider):
        delay = request.meta.pop('delay_request', None)
        if delay:
            self.requests.setdefault(spider, 0)
            self.requests[spider] += 1
            reactor.callLater(delay, self.schedule_request, request.copy(),
                              spider)
            raise IgnoreRequest()

    def schedule_request(self, request, spider):
        spider.crawler.engine.schedule(request, spider)
        self.requests[spider] -= 1
yield Request(..., meta={'delay_request': 5})
Scrapy + selenium requests twice for each url

Scrapy + selenium requests twice for each url


By : Frank Anushka Sachit
Date : March 29 2020, 07:55 AM
like below fixes the issue Here is a trick that can be useful to solve this problem.
create a web service for the selenium run it, locally
code :
from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)
api = Api(app)

class Selenium(Resource):
    _driver = None

    @staticmethod
    def getDriver():
        if not Selenium._driver:
            chrome_options = Options()
            chrome_options.add_argument("--headless")

            Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
        return Selenium._driver

    @property
    def driver(self):
        return Selenium.getDriver()

    def get(self):
        url = str(request.args['url'])

        self.driver.get(url)

        return make_response(self.driver.page_source)

api.add_resource(Selenium, '/')

if __name__ == '__main__':
    app.run(debug=True)
import scrapy
import urllib


class ProductSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['ebay.com']
    urls = [
        'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
    ]

    def start_requests(self):
        for url in self.urls:
            url = 'http://127.0.0.1:5000/?url={}'.format(urllib.quote(url))
            yield scrapy.Request(url)

    def parse(self, response):
        yield {
            'field': response.xpath('//td[@class="pagn-next"]/a'),
        }
Throttle Requests in Scrapy

Throttle Requests in Scrapy


By : user3074153
Date : March 29 2020, 07:55 AM
around this issue I figured this out.Since I am traversing in keyed order expecting a key to eventually not exist, I need to configure scrapy to work in FIFO order instead of the default LIFO order in settings.py:
code :
    DEPTH_PRIORITY = 1
    SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
    SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
Related Posts Related Posts :
  • Unit Test Won't Run Tests
  • Use two for loops simultaneously
  • can we display glass bar chart in python with google app engine
  • Scapy install issues. Nothing seems to actually be installed?
  • Why do people write the #!/usr/bin/env python shebang on the first line of a Python script?
  • What does s() mean?
  • ROC AUC value is 0
  • Why is this the value?
  • Best practices for logging in django project
  • Is there a python openid apps-discovery library to get appengine apps onto the apps marketplace
  • Cannot fetch a web site with python urllib.urlopen() or any web browser other than Shiretoko
  • Similar to ``tabnanny``, how can I check that all the python code is using 4 spaces as an indent?
  • Python: object identity question?
  • Multiple For loops, print else only once if condition is not met
  • Select one item from Series and keep the index
  • __repr__ method appears can't be invoked automatically for Exception class
  • Problem with list value (ValueError) in python 3
  • How to get TouchSensor nested under joint in Webots (Python API)
  • How to specify kernel while executing a Jupyter notebook using Papermill's Python client?
  • How to hide password in Database Connection?
  • How to get a list of dictionaries from the following code?
  • 'How to find out noun to which pronoun is referring to' in python
  • Removing a character (^) from each row of panda Dataframe and get unique words in each row
  • Changing a static variable of inherited classes
  • Django Query result comparison with if statement
  • Python: how to merge two dataframe based only on different columns?
  • Filter data by last 3 months and by ID
  • Inplace arithmetic operation versus normal arithmetic operation in PyTorch Tensors
  • How can I add custom signs to spaCy's punctuation functionality?
  • Ensure positive difference of two numbers
  • i keep getting an error that my list index is out of range
  • Is there a way to create gantt charts in python?
  • How to view network weights and bias during training
  • How can I force SAS to wait for a command to fully execute?
  • Remove all occurences of a value from a nested dictionary
  • How to ensure secure randomization for python password generator
  • Amazon SageMaker deploying from model artifacts - what object do we load from archive?
  • [] parameter or input used in sum() function - what is it doing?
  • Outlook email text formatting from Python application
  • Python 3 - comparing enums against hex value
  • Elegant way to check if a float is between two numbers in Python?
  • Understanding return [0,size-1][nums[0]<nums[size-1]] in Python
  • How do I make this script that heats up my CPU safe
  • RegEx for matching capital letters and numbers
  • What is differnces between array[0] and array[0:1] in Python?
  • How to run both items in scrapy function?
  • How to count the number of sequences of n numbers where no two adjacent numbers are the same?
  • Is there a more efficient way to re-write multi if-else statement
  • ValueError: Error when checking target: expected dense_3 to have shape (1000,) but got array with shape (1,)
  • SytanxError: Invalid Sytax
  • Setting debug = False makes the Django app crash with the following error, how to fix it?
  • How to get the average of many lists embedded within each other?
  • Paramiko with subprocess
  • 2D table conversion for example: y = f(x1,x2) => x1 = f(y, x2)
  • Return a literal string of a tuple in python
  • How to split a Column when you have same values?
  • How to perform str.strip in dataframe and save it with inplace=true?
  • why zip(*k) can't work when k is a iterator?
  • How to get list as an input from command line python?
  • Is Tensorflow Dataset.from_generator deprecated in tensorflow 2.0 ? It throws tf.py_func deprecation error
  • shadow
    Privacy Policy - Terms - Contact Us © 35dp-dentalpractice.co.uk