# Digikey 1M Part Information Scrape Summary ## How much time did it take to complete the scrape? Estimating from the Datadog logs, it took almost exactly 72 hours (3 days) for the scrape to finish. ## How many items did we return to PIM from the scraper? In that time, we returned 870490 items from PIM to the scraper. This includes products which had offers but no part information as well as products which had both. There are certainly some duplicates in here. For example, if we scraped `BAV99` from NXP and `BAV99` from Vishay we would have scraped both items each time that we made a search for that MPN. It is a consequence of not being able to match MFRs on the scraper side. ## How many Zyte requests did we use? In that period, we used 4.58 million Zyte requests. The estimation was 5 million. ## How many part informations did we create/update? I used this query to gather the information: ``` select count(*) from part_information where data_source_id = 6 and updated_at > '2022-03-28 14:20:00'; ``` We created or updated 435071 part informations. This is almost exactly half of the items we returned from the scraper to PIM. There are a few things that could explain why there are fewer part informations than items scraped. - Parts which we scraped multiple times due to duplicate MPN, but not MFR. - Returned items which had offers but no part information. - Errors on the PIM side (I don't think this is the case because our logging would certainly show an issue of that magnitude). ## Which SKUs were not found? In the million part list there were many SKUs which did not exist in the product table. It is too large to paste here, but a text file with the SKUs can be found at the following link: https://sourceabilityinc-my.sharepoint.com/:t:/g/personal/andrew_morrison_sourceability_com/EYZ6F7X_jo9PridtPeiptGkBKuYF8Bj1-m6xtdlV2DwrDg?e=IKVSbe ## What about products we have in our system for which we didn't get any part information? I spent a couple hours looking at 10 random products which exist in our database but did not give back any part information results. The findings can be seen in this excel sheet: https://sourceabilityinc-my.sharepoint.com/:x:/g/personal/andrew_morrison_sourceability_com/EVBo9ysFe_dFqk5wIWy4B2wBQBrNi2pFnw6hspfiRK3Ftw?e=t9s4B8 I think there is more time required if we want to do this check properly. If we really want to dive into this, I think a spike would be best. ## How many times were we banned? According to Datadog we received 30,100 bad responses. ## How many of each parameter did we get? description - 433455 image - 170877 datasheet - 406282 category - 0 (we need to start storing raw categories in part information) specs -435071 manufacturerLeadTime - 318689 alternates - 210432 lifecycleStatus - 431991 rohs - 0 (something could be wrong with our digikey EU RoHS mapper) ## What errors were thrown during the scrape? The errors thrown during the scrape can be seen here: https://sourceability.sentry.io/issues/?project=5557174 ## What actions are we going to take from those errors? There were three errors that happened more than twice: https://sourceability.sentry.io/issues/4040652200/?project=5557174&query=is%3Aunresolved&referrer=issue-stream&stream_index=4 This error means that the returned currency was not USD. It is good that we caught this error, but we should also check the instances where it was caught to see if we are throwing this in error. I checked several instances of this error and did not see this error, so I think that sometimes digikey returns the wrong currency and this error is being thrown correctly. https://sourceability.sentry.io/issues/4043204309/?project=5557174&query=is%3Aunresolved&referrer=issue-stream&stream_index=2 This error means that the environmental specifications table did not exist. We were not counting on this happening because almost all products have it. This task is to fix it: https://sourceability.atlassian.net/browse/CAT-3011 https://sourceability.sentry.io/issues/3910721718/?project=5557174&query=is%3Aunresolved&referrer=issue-stream&stream_index=0 This error means that the alternates API was not responding or banning us. I don't think there is anything we can do about this one. ## Did we learn anything important? We found out that Digikey RoHS values are not making their way into the part information table. We should probably store these values raw and map them later, but for now this ticket should fix the problem: https://sourceability.atlassian.net/browse/CAT-3029 ## What could we do differently next time? Michael would prefer if we do not use so many requests for now. Right now we use 4 requests per search if there are not multiple matches found. He thinks that we do not need the alternates API request for now and the updated pricing API response. We would have to filter out the offers when we did a part information scrape if we removed the pricing api call because they might have incorrect pricing. Here is the discussion: https://sourceability.slack.com/archives/C02BU0TAP7G/p1680013440243309