HackMD - Collaborative Markdown Knowledge Base

Team 1 - why is this topic interesting? - What kind of model do you plan on using? - What is the Brown corpus? - What is an example of a genre? - How could you ensure the semantic meaning of the sentence won't change? It seems that you should have another loss to ensure this. - What would accuracy mean in context of the style transfer problem? - Does the desired sentence style will be just a different sentence structure with same words or whole different words? How long does it takes to training the dataset? - What are your ideas currently on meaning representation for evaluating whether sentences are similar to each other. - Is there a baseline accuracy that you are planning to compare your results against? - What's the GPT2 decoder for? - How you get/define the 14 different genres? - what‘s your dataset and the size of it？ - What is the rationale for using a BERT encoder and a GPT2 decoder as opposed to the same encoder and decoding? - Can you elaboroate, or give an example of a different "style"? Does it mean a different syntax structure? - Based on your high level diagram it seems the style loss is calculated on the latent representation, is this correct or is the decoder somehow being fine tuned as well? - I am wondering how you are going to evaluate it. What evaluation metrics are you planning to use? - How do you evaluate such a system? Are the current evaluate metrics sufficeint? What are the weaknesses of these metrics? - In similar vision settings (where neural transfer is more popular), they use style and content loss to gauge the performance of similar tasks to yours. You mentioned style and possibly style loss, but how do you check for content similarity? - How do you determine the quality of the generated sentence in a different genre? Genre is quite abstract. - In your opinion, would combining different modalities for style transfer (with images) potentially improve the quality of the text style transfer? - What‘s the expected result of the classifier？ - Are there any previous work related to your topic? - What is the definition of a style? Can you give some examples of different styles? --- Team 10 - Why did you use ChatGPT to generate image labels instead of coming up with your own? - Do you think there will be some overfitting problems with the model - would it basically reverse engineer the image generation capations? - How did you ensure your data was general enough to properly train the model (i.e. uniformly distributed over the set of caption-types you wanted to generate)? - What is your main motivation of selecting a generative model such as GPT-3 over traditional image captioning models where the network can be configured? - Did you have a different approach before ChatGPT was released? What is the reason you decided to go with ChatGPT instead of your previous plan? - Can you explain in further detail how you generate image using the AI generator? - How do you automatically collect responses from chatGPT? I found it sometimes refuses to response due to frequently requests. - Can you explain further how you choose the best model - How do you generate the groudtruth label for your dataset? - What is a skip connected fusion block? - What is the motivation using generated images for caption generation? The objective is not clear. - For what concern do you choose to use Chat-GPT to generate your dataset? - In the related works, cross model's each model has a different function between the others? For like, one for sentiment classification and one for text generation? - How are your two datasets different? - How do you compare the difference between the input text and the output text? - Did you run into any problems generating your dataset from gpt3 -> MIdJourney? Did it ever make an image that didn't line up with the image description? - What are some benefits and drawbacks in using GPT3 to generate the image captions in comparison to human-labeled captions? - How do your evaluation metrics work with the results? - How is your dataset different from others that exist for this / similar tasks? - Did you use the pretrained weights or did you train it yourself? - Can talk about the plan a little more? - Did you consider other models besides ChatGPT to generate the captions? What made you choose ChatGPT? What about the image generator? - Do you have performance results yet? Have you found that training on artificially generated images provides good generalization to real images? - What purpose does this task have? How can it provide value to people? - What's the reasoning for using chatGPT to generate image captions instead of using captions in other pre-existing datasets - Why did you choose to generate synthetic image dataset instead of using real world images? - For different visual backbone options you are going to try, how many parameters does each model have? How do they compared to the original backbone in the paper? - How can you make sure all the generated images are in good quality? - Are the test image encoder backbones also pre-trained? - Do you think if it valuable to compare the captioning output with the original caption ChatGPT generates? If so, how do you plan to compare them? - How is your method different from previous approaches? - How do you choose your image captions to generate your dataset? - What evaluation metrics would you be using to test your model? - I'm a bit confused by your training procedure -- are you training your model on the generated images, with the label being the prompt that generated the image? How would the model learn when the image is inaccurate? Where are you providing negative training samples? --- Team 11 - What is the point of being fully online? How is it better than offline with/without a private data set? - Why did you choose MuJoCo physics instead of other options, like Nvidia's Isaac Sim? - How did you generate your dataset? (such as how do you extract it from scratch) - What is the advantage of fully online DT? Does it perform better? or What is a weakness of fully online DT - As someone who is not familiar with the environment in your presentation, what is the particular task that your transformer and RL algorithm are trained to perform? - What is the performance metric for this task and what are the previous work results? - What is the task of your decision transformer? - Where is the algorithm improvement in the latter version? - Your work is very interesting. I wish I should have chosen this. How is your model selecting a new policy? I am not too sure about the state space/action space formulation still - What is the task you are trying to complete within the MuJoCo simulator? - Can you provide a few examples of benchmarks that you are considering to use? What are the differences among them? - Your work looks quite interesting to me! I wonder how you implement the DT part without exploration policy. You said your version is fully online, then where does the input comes from. Does it means when without exploration policy, you just simply start from scratch, and maybe keep using random decisions for a long time? - What are some limitations of the transformer architecture in the online setting for RL? - What kinds of results you want to generate? What is the performance of the baseline? - What was the motivation behind this project and model to be used? - I was slightly confused by the replay buffer. Basically, you let the model run in the environment, you save the history of that run using the buffer, then you train it based on that run? --- Team 12 - How will you generate labels for your augmented data? - What do you mean by proper evidence? - How do you plan on improving the conflict detector. - What is the state-of-art approaches and their results? - How do you initialize the loss weight - Will reordering change the meaning of the whole context and further affect the accuracy? - Why does adding repetition increase accuracy? - When duplicating sentences to produce implausible stories, did you consider changing the sentence a little bit so the model doesn't just find an association between exactly matched sentences and implausible stories. - Have you tested whether the model overfits to the specific scenarios in the dataset after your repitition augmentation, or do you expect that it might? Why or why not? Overfitting seems to be a risk anytime you repeatedly sample the same data (oversampling) even thought that's not exactly what you're doing. - Group 12: How did you decide which weights to adjust? There seemed to be 5 hyperparameters to tune, which seems like lots of choices! - How does adding repetition not "mess up" the states of the example with the system staying in the same state over two time periods? - In intro, you mentioned it has a doubt that whether the model is truly understanding the reasoning process or not. What exactly the standard can be used for true understand of the reasoning by the model? Is it just accuracy? or even the accuracy is high, it is possible that the model still doesn't understand the reasning and just uses another feature of text to getting high accuracy? - For data augmentation, will you consider using other augmentations to change the text more aggressively? - Is there any way the data augmentation techniques could backfire (increasing ability of model to exploit statistical cues)? - Would you explain in detail that how add a module can improve the conflict detector's problem? - Why the model used to preprocess the physical state to extract more information can be helpful? - what is the possible reason that improvement is not significant when the physical state is added? - Why the verifiability in your method is 0%? Not very sure about the meaning of this metric. - why epoch 5 has the best accuracy with your model? Is there any specific reason? and why there was a huge improvement in the step 4? - what kinds of modules you plan to experiment? Can you elaborate it? - How are you changing the loss weights so that the loss is no longer linear? What other ways could you augment the loss function? - What do you mean by adding a module? What modules are you thinking of trying? - What extra information is extracted using the added preprocessing module? - How does the module added in the preprocess capture more information and improve the training? - For automatic data augmentation, how does the program automatically determine which sentence can be duplicated and which pairs of sentences can be reordered since a plausible story might turn implausible if some sentences are duplicated or reordered? - How do you avoid overfitting with data augmentation? Are there limits to this approach? - How do you balance the trade-off between computing resources(time) and accuracy, as well as the verifiability? - An interesting task! Do you have any good methods for tuning the hyper-parameters? I see there are 4 loss terms in your loss funtion, which might be not easy to find the optimal ones. - what is your primary evaluation metric? - How did you preprocess the physical state using the new module? - Do you plan to introduce another dataset to verify your models’ generalizability? - Why did you use repetition as a data augmentation method? What effect does that have on the story? --- Team 13 - How are you going to extract sentences that have financial statement for the dataset? - For price movement prediction, is it predicting a real value (like regression) or a label indicating whether it is increasing or decreasing (like classification)? - Can your model be used in other filed？ - how do you deal with the unbalanced nature of this problem? ie most conversation on social media about stocks is not affecting stock prices at all - Are there any other features that you think would be helpful besides annotated Reddit posts and Twitter tweets to find financial sentiment and to include in your FinBERT model? - Could you elaborate on what accuracy means for the related works? Is it just whether or not a stock went up/down? - What dataset do you use? Do you do any preprocessing? - Did you annotate the twitter/Reddit data yourself? How is it annotated? - Does your price movement predictor work better in the short term (day-to-day) or in the long term (over a week for example)? Why? - Why was naive Bayes used as a baseline for the problem, is there a better model that could be used to be a baseline? - Can you train sentiment analysis and price movement prediction in an end-to-end fashion? - It seems the time would be a very important index that affecting stock price. Will yours model taking dataset from different time range into account? - How to solve the limit of the model? As we all know, social media can only partially explain the stock price. There could be upper limit of the accuracy. - Which happens first, stock movement or sentiment of the market? Seems the first causes second? - How do you evaluate the model from a open world setting? - Can the labels of stock increase or decrease be meaningfully applied to sentiments from platforms like Twitter or Reddit with a strong correlation, or can there be different labels for very similar language because of a weak correlation? - What is the accuracy of the fine-tuned model? - Where are you going to use Transformer? - Do you compare the simple random prediction model with your model? Because the accuracy of the model is about 50% is not high enough. - How would you predict stock price based on positive/negative comments? - How exactly do you bridge the gap between sentiment analysis of social media data with future stock prediction? - Have you checked the balance of the dataset? Or tried other binary metric like MCC? - What is the baseline model used for this problem? - What Reddit comments did you use for this analysis? Did you only select the most upvoted comments or did you weigh all of the comments equally? - What is the intuition for using Finbert over other pretrained models, and why not train a model from scratch? - In future plan section, you mentioned "Scrape Twitter Data". Can you provide more details? What kinds of tweet you want to scrape? - How are you going to define the prediction of future price? On the scale of day or month? - Do we know what level of accuracy similar state-of-the-art models achieve? - Are there certain categories of stocks that your model does really well or really poorly on? - Is the model only predicting two options (up/down)? If so, in that case, previous works are only better than random guessing (50%). - Would you consider adding more features apart from the sentiment from analyzed social media? If sentiment is not very representative of the stock prices, what other data do you think will improve the accuracy of your model? --- Team 14 - How is the overhead performance for GNN? - What if the some of words represent for the event are replaced by the synonyms, the model still is able to keep the performance or will it be decreased? - how are you preprocessing the dataset and what features are you using - Why not try GraphSAGE? - What's the advantage of using GCN? - Analysis on why your Co-GCN does better or worse than other state-of-the-art models? Or are the results purely experimental? - Could you give an example of different graph structures you mentioned in future plan section? - how to evaluate the model performance？ - What graph structures do you plan to experiment - How did you decide on the dataset to use? - Is this causality identification a very challenging task for deep learning? Would it be possible to use your model to identify causality in other modalities such as image/videos? - causal events: what are some of the reasons why one graph outperforms the other two on this data set (according to F-score)? How are the metrics designed? - What is the causality here? What is the difference between causality and consequentiality if you use pre-trained model? Is your model predict next word or predict an event that they actually understand? - Have you considered using a Graph Attention Network? - Does the model predict the event causality for every token in a sentence? If not, how do you determine the token of interest? - What is the expected comparative utility between different types of graphs, such as sparse vs. complete graphs, for the GCN layer? - Sorry that I lost track of your first slide. As far as I see, you are trying to use a GCNN to capture event story. My question is, can GCNN successfully capture sequential information? If not, how is your approach interpretable? - Do you have other methods to improve the accurancy？ --- Team 15 - How do you design the loss funtion - Why Glove embedding, not other contextual based embedding? - How is the object detector module implemented ? - Did you tried vision-language models? - What does it mean by using a transformers based on LSTM and RNN? Since it has huge difference between them, what does it mean you used both of them? - What testbed are you using for your work, and what considerations went into the decision to use this testbed? How would the same corpus perform on a different image testbed? - Do you have ideas for which other datasets you'd like to extend your work to? - How accurately the generated sentence output has correct grammar? - How will different embedding types and sizes affect the quality of the output? - Other than prediction accuracy, how about the time/memory performance of all your models? - Do you plan on using other state-of-the-art models such as BERTs? Would they improve your performance? - What differences do you expect to see when you are using different datasets? Do you have an idea of what datasets you will try and what you predict will happen? - Does Single Stage Detector (FCOS) suits better for this task compared to the Double Stage Detector since visual features are less noisy as it does not use anchor boxes for training when downstreamed to the transformer based models? - Do you have any guesses as to why the model performed better when identifying certain things (like horses) and worse for other things? - Have you noticed any differences in the quality of the captions generated with B/W images compared to RGB images? - How does an object detection method improve the caption generation? In other words, what effect does localizing objects in the image have on the caption generation? Another question would be that be using the localized object information, would you lose the global context of the image? - How do you plan to implement the evaluation? Not sure if text data is provided in the dataset. --- Team 16 - How do you plan on comparing the GPT-3 summaries with your summaries? - Are the current summarization metrics sufficient to evaluate the quality of the system? - Can you elaborate more on the criteria for the clusterization method? How were the distances between sentences calculated? - Are you just using the standard SVD? Or possibly exploring the performance of randomized SVD as well (especially if the data are very large)? - Why rouge was perfectly fitted for your dataset? - Is the cosine distance sufficient for clustering? Are there any good metrics that can improve the performance of models? - Which part of your project is unique? ( have you tried modifying previous works or combined previous works) - From what perspective do you intend to further analyze the results? - Will the summarization be effective when there's a very long sentences that's half important half non-sense? - Is the any special reason that you use Roberta-base other than Roberta-large or other models? - Why did you use roberta to train the model as opposed to another pretrained transformer? What are the benefits? - what are the pros and cons of method 1 and 3 (slides only mentioned pros and cons for method 2) - What did it mean when you said method tf-idf was sensitive to stopwords and lemmatization? Wouldn't tf-idf be resistant to stop words? And did you try it with and without these to see if they affect results (or do you have those results)? - You mentioned ROUGE is not good. why it's not good? In what way it might be improved? - Why did RoBerta perform worse on ROUGE-2 vs ROUGE-1 and ROUGE-L? - Did you fine-tuned your transformer? and texts in CNN/DailyMail data are relatively long, how did you choose your input and output length? - For three metrics you picked (ROUGE-1, ROUGE-2, ROUGE-L), what are the differences among three of them? Do you think that those three metrics can provide enough insights about the performance of your methods? - Is there any way to evaluate whether or not the summary is factually correct? Is that captured in one of the three metrics: precision, recall, and performance? - What is the objective function of text summarization tasks? - What in-depth analysis do you plan to preform? I see that there are pros and cons for different methods, and each method is suitable for different situations. Do you plan to compare the performance of the model for different kinds of datasets, such as large and small ones? - How do you calulate the evaluate metrics while it's a summarization without specific labels? - Could you clarify how ROUGE-L corresponds to the fluency of the summary? - As you have said, good summaries should be semantically similar as a whole, while the rough scores seems to focus on measuring overlap on a lexical level. Do you think rough scores are good ways to evaluate similarity of two summaries semantically? - 16. Do you plan to implement any abstractive summarization approaches? - Why is this topic interesting - what is the purpose of the project - What are shortcomings of previous work models on text summarization? Does your model improve on any of these shortcomings? --- Team 17 - How to improve inner sentence incohrence? - Why could sentiment analysis be effective in predicting stock price? I mean, besides the result, how does this method make sense theoretically? - How do you prevent overfitting in real-world examples with data augmentation? - Why did you choose the ALBERT model over other models? - What are the data augmentation methods you consider implementing? - What are your proposed approaches? - Can you elaborate on what transformation techniques do you plan to use? - Could you get into why your model choice improves coherence over other baselines and what has been previously used in the field? - What kind of data deformations are you planning to try? - What specific techniques do you have in mind for data augmentation? --- Team 18 - What is the motivation for training and comparing LSTM results in addition to BERT? - How are the emojis encoded in the model, is it the same as images? - How are the word embeddings used mapped to the emoji representation. - Is it possible that Emoticons such as :} is just used as text and code instead of emoji? - Did you also consider using RoBERTa and DeBERTa models instead of just BERT? - What if you just count how many positive/negative emoji's in your sentence as a baseline, what would be the performance? - From your observation, have you noticed any emojis with specific characteristics causing issues with the model? - If divide the class as 2 classes, what should the model divide the emoji with neutral emotions? - Are there emojis that are used in completely different contexts between cultures? To what degree might your model account for this? Might your training be biased towards certain cultures and thus certain emoji use cases in any way? - Since emoji usage evolves over time (i.e. crying emoji is now often used ironically), do you think using a dataset collected over such a long period of time will affect the results of the sentiment analysis? - Have you considered emoji decomposition for emoji representation learning? - For LSTM, how does the world embeddings affect the network accuracy? Have you tried adjusting the c-cell weights. - Is there a one-to-one mapping between iPhone and Android emojis or did you have to account for that in your project? Or was that already taken care of in the dataset you used? - Does the full dataset in the plan slide refer to all 3.8 TB of tweets? - Have you considered predicting the positivity/negativity of an emoji as a image regression problem? Then you may use that value as an additional parameter for sentiment analysis. - Will emoji have effect on sarcasm classification and other tasks? - What made you guys choose LSTM and BERT over other types of models? - What was the main motivation of the project? - Interesting direction! My question is that is it possible to utilize the visual features of emoji? Cause they reflect the sentiment information in my understanding. - How do you do data cleaning for tweers - How do you clean the data retrieved from Twitter? - Would emojis influence more on models' performance than text? - Group 18: Among all the samples, how is the proportion of text content accompanied by emojis? In other words, how often do people express with emojis? - Any way to aggregate full 3.8 TB dataset to a much smaller and more manageable representative dataset, that you can run and compare the baseline models and your model; I think running on full 3.8 TB dataset might be infeasible? - How are you going to deal with emoji that don't have clear emotion definition? - How do you think your model will fair against sarcastic user inputs, where the emojis used don't match the expressed sentiment? --- Team 19 - What does the Data Cleaning layer do? - Where did you get your data? - How did you deal with emojis in your dataset? - Your explanation layer output the words that explains the prediction. How do you evaluate such explanation? - In the method architecture. What does Twitter API does any contribution for the classification? And what does Sentiment Analysis layer does? It just does sentiment classification or just converted it into word vectors? And How did you cleaning out the data? (which information does removed?) - How did you deal with special characters such as emojis and #hashtags in your data? - Why don't you think that LSTM will not achieve a comparative accuracy as transformers do? - What's the baseline your are comparing with? Do you obtain any improvements over the baseline? - On the first results visualization slide, it looks like the neutral words don't show up on the visualization. Is that an intentional choice, or is that just because both the background and the font for that word is white and blends into the slide? - Why "fan" is a negative word in tweeter? lol - How do you calculate shap value? - Did you find any unexpected words that hold a lot of positive or negative weight in the sentiment classification? - Did you find any words that seemed to have abnormal sentiments associated with them? - What other models are you planning on using to compare to your current models? Is there any reason you chose these models? - Did you observe any interesting results on mixed sentiment tweets as well? - How does the model explanation layer work? Do you evaluate the keyword analysis it gives? - Are there previous works that use deep learning on this task? - Much of Twitter is satirical, does the data take that in mind? - How do you design the data clean layer? - Will your model predict same sentiment for ‘interested’ and ‘not interested’ - How to evaluate model bias based on racial or geo-location basis while classifying the tweet as positive or negative it would be interesting to look at the statistics of these. - Does the length of the tweet have any effect on the sentiment of the tweet? - Since your method is based on specific keywords, how do you think your model would fair against sarcastic tweets, where positive words may be used to signify a negative sentiment? --- Team 2 - What exactly is OFA (One For All) model? - How do you choose the most appropriate dataset - How do you think this project will impact the industry? - Any way to augment caption dataset with synonyms/antonyms to reduce error? - what dataset are you using? --- Team 20 - Is this a word prediction task or sentiment classification task? - What's the motivation for GNN at the end? Is there a specific NLP approach that uses them? - Are there any previous work (model or approach proposed) done in this area? - Are there emojis that are used in completely different contexts between cultures (and thus between languages)? To what degree might your model account for this? Might there be any bias in your training data towards certain cultures within a given language? What are some ways one might try to account for this training bias? - Do you think emoji prediction could be improved by combining it with other methods such as sarcasm detection or sentiment analysis? - Did you consider trying RoBERTa and DeBERTa models instead of just BERT? - 20. What if a single tweet contains multiple different feelings? - There are a lot of abbreviations on twitter. How did you handle these? - you transform url to special token, but some of them in tweets do have some meaning, how do you make sure url will not affect your result accuracy? - What corpus of emoticons was used to tokenize the emoticons? - How did you select these 20 emojis to analyze? - Group 20: What advantage does the inclusion of the dropout layer provide? - Does the complexity of the model guarantee better accuracy? What is the reason for performance difference between TextCNN and BERT? - Why do you think Text CNN outperformed the Bi-LSTM models? - Have you tried more advanced Pretrained Language Models such as RoBERTa/ALBERT? Do you find they can yield better performance? - Are there previous works that use deep learning on this task? - Have you looked at other metrics to evaluate your models (like F1 score)? - How do you transform the emoticons? --- Team 21 - Is Rogue evaluating both extractive and abstractive summaries? And could you talk a little more about the evaluation method Rogue? - How do you deal with more nuanced user opinions? (i.e. they liked feature X of a product butnot feature Y) - How do these approaches avoid gravitating toward a small number of very similar fake reviews? - Is Rouge a good metrics to use, why or why not? - I was slightly confused about the dataset that you used. Was it from Amazon? Or a combination of multiple datasets? - How are you planning to determine positive and negative reviews? Is this done through sentiment analysis? If so, what models do you use for sentiment analysis? - What exactly is meant by extractive and abstractive models? - How are the extractive part and the abstractive part combined? - where did you get the customer review dataset? - How do you choose your evaluation metric? - Why the results of pros summary dominate the cons? Thanks - What is the difference between abstractive and extractive? Which is better for your model? - Different products have distinct features reflecting their pros and cons. For example, (fruits, fresh), (table, stable). These features cannot be unified. How do you address this issue in your model? - I really like your work. I am just wondering how you were able to sample the data. Did you obtain consent from students from previous years? Have you considered classifiers other than BERT? - Can you explain more why you want to use extractive methods? In addition to ROUGE or performance metrics, do you notice any other differences between methods? - I see the Rogue scores in the presentation. I see most of them in the range 0-0.3 I am curious how you interpret this range. - Could this work be applied to detect fake or bought reviews? - How do you preprocess the data - Extracting sentiment from reviews with summarization is good idea, but don't the star ratings from reviewers already indicate sentiment? - What effect, if any, do you think removing neutral reviews would have on customers' perception of the product? --- Team 22 - what preprocess method did you chose? - What is the factor the Max voting considers for the voting? Does it just a random voting? - What are the applications of this work? - Will you be able to tell which method will provide you with the most accuracy increase? - Which approach of the 4 you mentioned do you think should give the best performance and why? - Do you have any expectations for which of the four ensemble methods will be more effective? - How does performance change with variants of the same model type with different initial training conditions? Is there a difference between an ensemble of 3 and 5 BERT models? Are model ensembles implemented strictly with heterogenous model structures or also homogenous? - Is there any particular reason that you choose these three models? What are the pros and cons of different ensemble methods? - How will you evaluate the results once you have it for ensemble methods? - how did you decide which model to experiment with in order to potentially improve the performance? What are some obstacles in reproducing the state-of-the-art model? - What are the reasons you chose each method for this project? - Why BERT, RoBERTa, DeBERTa as base models? Would a non-BERT transformer give other helpful insights? - How does ensemble methods compensate each other to get a better result? What's the intuition of using ensemble methods besides higher metrics? - Why did you decide to try out ensemble learning strategies? - What's the difference among those three base models? - What do you hypothesize will be the best ensemble based method? --- Team 23 - Have you noticed and biases in the model? - How exactly do you plan to combine the three approaches listed? - How do you evaluate "connectedness" in the generated responses? - Why do longer sentences tend to be more illogical? - does limit length totally solve the logical problem? - Have you thought of weighting the answers instead of deleting them? Maybe you lose some information? - how to design qualitative metrics to evaluate different aspects you try to capture in your responses? - what do you think your results compare to the recent chatGPT? - This is a really interesting topic! What impact do you think this project can have on mental health/psychology? Could this lead to better automated help without the need of humans? - Do you have preliminary results now and which method is better? - How is the GPT-2 model going to be fine-tuned? Is it more parameters? - Do you think your system will be robust enought to not give too many false positives and false negatives for real clinical uses (may be out-of-distribution)? - Why using masking in BERT can better understand sentiments? - Your model is based on Chinese or English? - How do you estimate the accuracy or 'goodness' of output? - How does Roberta model help with your performance of Question-answering? - How do you apply small granularity tokenization to text understanding and what is the intuition behind doing so? --- Team 24 - Group 24: How will you combine the 3 approaches together? - what preprocess methods are used - What stronger models are you considering? - Have you noticed any differences in the generated outputs with respect to the hyperparms? - What exactly do you use for measuring lexical complexity as a constraint? - How do you evaluate "diversity" in generation? - Group 24: How do you evaluate the content you generate? Do you think they are good for dictionary example sentences? What about your evaluation standards? - Why is BART used as the baseline? Is it because of it seq2seq base? - How do you plan to evaluate how good the beam search method is compared with a normal greedy search? - How is diverse beam search implemented. Kindly share some details. Thanks! --- Team 25 - What was the reason to use those other datasets in the first place? Do they have the same format / context as TRIP? If so, why do you think performance decreased? - Is there a method that you can come up with in order to deal with inaccurate translation for long sentences? - How do you deal with the dataset format problem when using other datasets? - How do you choose the new dataset? - Which pretrained models do you plan to use in the future and why did you choose them? - Why don't you think differing the pre-trained models doesn't differ much from the others? - What are the (possible) reasons that the pretrained models not outperforming the baseline - What are each of the different variants used in gradient surgery, and how effective is gradient surgery? - In your presentation, on slide 8, you differentiate between low- and high-level tasks. What is the difference between these two types of tasks? - What other pertained models are you going to experiment with and why do you choose them? - What are the challenges in this research - How was similarity of datasets chosen (SQuAD 2.0) to TRIP computed / evaluated? Anything lost by running author's model on dataset other than TRIP which it was targeted at? - What does new formulated Objective functions and learning rate would look like? --- Team 26 - what preprocess you guys chose? - Do you use any tricks in data preparing? - Do you think the current evaluation metrics are adequate? Why or why not? - What exact intuition the X-Linear attention block use for boosting the performance? - What is a parallel block? - How much does the more complicated attention block design affect runtime? - Why do you select Coco as your dataset? - What is the task that you would like to test your new attention heads on? - could you give some example about the result? - What tools (or assumptions/theories) would you use to explain the difference in performance from these changes - How do you build a multiple-lingual model? Will one Bert model is good for multiple-lingual model? - Since you are using the same hyperparameters for your models, how do you plan on explaining why certain methods/functions/blocks work better than others? - I don't understand the diagram in slide 4, Isn't "softmax + weighted sum" an O(n^2) operation? Not linear? - How would you compare your results if you end up training a smaller model and for lesser amount of time? In other words, how would you know whether the results are expected due to the model being undertrained or due to some error in the model? --- Team 27 - What is your inspiration of using this proposed method? - Sarcasm can be pretty subjective. Just wondering how humans will perform on these tasks? - Do you think the differences of Arabic, Hindi and English languages would affect the results? - How are you modifying your architecture for the proposed model? - Do you have any data imbalance between sarcastic and nonsarcastic tweets? If so, have you worked around the potential issues that might arise from that? - Did you manually label the sarcasm/not sarcasm for every tweet in the Hindi dataset? - How did you get the Hindi dataset - How is the labelling done for the new Hindi dataset prepared? - how did you modify the architecture? why do you think this modified algorithm works better than baseline BERT? - What is the difference of your model comparing to model from paper? - Can this methedology scaled for multilingual datasets since scarscam is subjective and depends on person to perosn and can LSTM based model can be replaces with attention based transofrmer model? - How do you deal with the imbalance problem of these two labels? - How to define sarcasm? - How did the model perform for the different languages and do you have any ideas for why the results were different (if they were)? - How do you build a multi-lingual model? Are you using one bert model to perform multi-lingual task? - What is the main difference in general text classification models and sarcasm detection models? Are there any properties tailored to cope with this specific purpose? - Is there any different feature engineering/techniques you apply for different languages? In another word, is the language type play a role in your model? For example, for English data, you may use technique A but for Hindi data you may use another technique B. - In your results, it looked like your F1 scores were lower than the accuracy. Have you looked into why this might be (imbalanced classes / predictions)? - I'm curious about the degree to which sarcasm learning in one data set, such as tweets, can be transferred to a different test bed, such as Shakespearean literature, political speeches, and so on. I suspect that, as a result of the culture of twitter, the character limit on tweets, along with other factors may alter the way sarcasms presents in your test bed versus other possible instances of sarcasm. Even if your model displays low transfer learning to other datasets like those I mentioned, I think this would still provide a very interesting result on the different ways people present their sarcasm in response to setting, audience, and format. - Did you use different models for the different language datasets or could the same model be used? Did you observe any language which seemed to perform better than the others? - Any greater significance of evaluating on these specific different languages (English, Arabic, Hindi)? Any analysis on the key syntactic differences among these languages and how/why it makes the trained model(s) more robust? - As this project takes in data from different languages, what all pre-processing methods which are equivalent to stemming, tokenisation are being used, if any? - May I ask what's your data preprocessing on your Hindi dataset? --- Team 28 - Why did you choose to use the BERT classifier? (I like this project for the way it is grounded in a real useful application!) - Did you annotate the dataset? how to define what is considered urgent? It can be subjective. Did you check agreement between two annotators on a subset of the data? - Are there any privacy issues while using piazza posts? How do you define urgent? - Did you labeled the data yourself? - What will your another NN structure be? - How do you deal with the ethic concerns for data collection? Have you confirmed with students from previous eecs370 courses whether you can use these data? - How do you scrape Data from the last 3 semesters of EECS 370? - How do you fuse information for metadata? (What is the mathematical formula for the fusion?) - How did you do the annotation? If it's all purely manual, it's really a lot of work. - I wonder how you labeled these collected 12,000 posts. - Do you manually label each post as urgent or not? If so, how you label them? I think if you label a post as urgent according to some keywords like "illness", you model might just learn very simple things like if it detects those keywords, it just outputs urgent. - How to interact slack reactions in your model? - Which type of sampling was the most useful for your results? Over or undersampling? - I'm curious -- is it possible and feasible for a student with access to the notification bot to "hijack" the model and negligibly alter their Piazza posts so that the bot always labels them as urgent and they always get a prompt response from course staff? - Do what degree do you think that transfer learning between courses would be successful? For example, would a model trained on EECS 370 exhibit similar accuracy if tested on EECS 280? - I want to know more about the dataset. What is the approach to manually label such a large dataset? To what extent is the dataset imbalanced? - I was really curious about why your undersampled model have the highest recall. - You mentioned that the dataset consists of piazza posts from the last 3 semesters. Do you think you should obtain consent from the students before scraping the posts for the research purpose? - First of all, I'm really curious about why your undersampled model have the highest recall. And I'm also wondering your intention of create original, oversampled, and undersampled datasets. What is your intention to create these three datasets and can you briefly talk about the definition or the format of these three datasets? - Are you concerned at all about the potential of biases that may occur from manual annotation? Have you thought about possible ways to make your annotations more fair? - What is the advantage of creating an oversampled and undersampled dataset? - I think TN is more cricitical are more important for this specific task compared to FP. While calculating are you guys considering giving more prefrence to TN. As you dont want to miss something that needs immediate assistance (TN) ,its okay if a post does not need critical assistance gets mislabelled as it needs assistance (FP). - Could you provide the amount of imbalance in the dataset? --- Team 29 - Are there other related research on this topics? - Do you plan on using other BERT based models? - How did you preprocess your dataset? - In what ways could the bias of the model be mitigated to yield more reasonable results? - Do you think your approach can be applied on other pretrain models? - What is the consistency and verifiability of your model? - How were you able to reach those three inferences, e.g. the model is making unnecessary additional assumptions? - What is your motivation of using the prompted guided dialog method? Does this method help you with your research question? If yes, how? --- Team 3 - 2 - Do we manually give captions to the image? - 2. How will you handle the error on caption itself in the future? - 2. Did you experiment with synonym detection. For example if you get "sliced onion" from the OFA model you can match this with "cut" "chop" etc. to fix some of the errors with mismatched verb/nouns - Group 2: Do you currently have any proposed idea to solve the evaluation error caused by verb synonyms? - Group 2. In the slides it's mentioned that the best verb-noun pair is selected from the image captions, I'm wondering how is the "best" verb determine and selected - 2) Is there any way you can ensure synonyms don't result in an error? Perhaps comparing word embeddings to calculate your metric instead of a black-and-white accuracy. - (Group 2) Do you plan to do any solution to the synonyms? Like counting similarities? - 2 – is it possible to use something like cosine similarity to help address the captioning error (e.g. cut is similar to chop), or using a dictionary of synonyms - [Number 2 question: What are your motivations to finetune the OFA model if the baseline can yield the accuracy needed] Number 3: How does the model extract the action from the ASR? - For group 3, what is your dataset and what is the size of it - How do you fine-tune the paramenters? - (this question is for group 2 but group 2 is not in the dropdown menu:) Why did you think word embeddings would provide words about causality? (I like your set up!) - How did you fine tune you models? - Group 3: how are these datasets generated? Have you explored a reinforcement approach to avoid long dataset annotation? - Does the recipe recognition capture the order of the process such as washing the orange before slice it. And what happen if the order of recipe input data provided to network - What did the mistake detection represent? - Were there other approaches you considered? What made you decide on the chosen approaches? - What do you predict will happen if you only train on a subset of the coca data? - in your dataset, permutated steps to generate mistake datasets, it may make the problem overly simplified as it may be logically incoherent. How do you ensure the dataset maintains the right level of complexity? - Group 3: How are you deciding what subset of the CoCa dataset you are using? - What is the captioning loss used by CoCa? - Are there distilled pretrained versions of CoCa that you can use to save time of having to pretrain CoCa yourself? - Group 3: What are the tradeoffs/advantages when comparing the two approaches that were proposed? - Group 2: What exactly is OFA (One For All) model? Is it a neural network? Group 3: How did you match the text desciption with the video frames? - Do you have any plans on sampling the dataset to maximize performance? Or just random sampling? - How much training time on the whole dataset did the previous work take if you guys have any idea? - How large is the size of your dataset? It seems to be a task that needs a lot of resources/ - I see that the metric you're using is accuracy. What do you mean by accuracy here? I'm a little confused because your data set is very complex and there are many things to examine. - Can you explain how your two approaches different? why do you choose these two methods? - 2. What is your hypothesis as to what will be the results from the plan to train on a subset of the data to address the accuracy-speed tradeoff? - Do you think that CoCa is going to give a better accuracy than distant supervision? How will you balance the tradeoffs within CoCa? - For CoCa model, can you give more insights about the result? In other words, how are you going to evaluate your model? What metrics are you going to use? - What are some real life applications for this research? - Is CoCa trained on sequences of images? You showed the downstream task of video action recognition where each image is sent through a pooled attention mechanism, so I'm wondering if it's trained on sequences of images, or on individual images, or both? - How did your approach differ from previous work? - How many frames from each video are used? Are they chosen randomly or are they manually selected to correspond to specific steps? - How will you test your hypothesis with evaluation metrics? - how long it take to training the model - (Group 2) There seems to be a mismatch between the generated captions and the labels, do you think an alternative method to induce the generation of action-focused captions will be more effective, and will that be feasible? - For input, do you use multiple frames from one video or do you select one frame to represent the information from the video? --- Team 30 - what training preprocess method will you chose - What tools did you using for data augmentation? - why adding that context would help? Does the baseline also implicitly keep track of the context? - Why do you think adding context awareness decreases the verifiability? - Why do you only augment the training data? - How do you integrate the context into the input sentence? --- Team 31 - how many types of syntactic structures for thought chain design did you adopt? - Have you played with ChatGPT? How do you think of its math module’s ability? How can you evaluate reasoning procedure in a automatic way? - Is this project mainly about reproducing chain of thought? Anything new you would like to try (for better or worse) which is not explored in the original paper? - How do you plan to fine tune the model? - How to measure the distance between prediction and true answer? like the answer is 30, but a model getting 29 is not necessarily better than a model getting 20. - Are there any limitations to chain of thought reasoning? Is chain of thought reasoning applicable to other language tasks such as text summarization? - Based on your research for this problem, what is the biggest obstacle that prevent LLM to solve more advanced math problem? - Interesting work! Do you find any similarity between missing concepts? Thanks! - Why is this topic interesting --- Team 32 - Would you consider fine-tuning on other transfomer models (e.g. GPT)? - How do you plan to resolve the bias issue? - Why is softmax not a good indication of performance for the binary model? - How do you address skewed data for training? - Why do you think such imbalance toward "idk" happen in your model? What do you think could improve it? - Did the paper provide justification for why BERT based models performed so poorly with the 3rd label as unknown? - Is there a better way to account for imbalance in the dataset? - Why are you planning to do with the model random forest and gpt? what is the merit comparing with the model you did? - What is your specific training strategy for combined models like BERT+GBDT? - Is there perhaps a way to "normalize" out the bias from imbalanced dataset (bias towards "IDK")? Is it feasible to look at another Yes/No/IDK dataset that is more balanced, if there exists any? - Since there is a class imbalance, will you still be using accuracy for your metric? - Are there any criteria for selecting a calibrator other than scores? (I think just trying all possible calibrator is not doable.) - What if we do not treat idk as another label? Instead, we can designed a loss function with zero loss when the prediction is around 50/50 on yes no and the answer is idk. --- Team 33 - What was the motivation behind changing the input format? - Is there any specific type of structured encoder that you shall be using? Will it also be treated as a factor on which the model accuracy can depend on ? - What are the baseline models you plan to compare your own results with? --- Team 34 - What metrics did you chose for evaluating? Why that metric? - How will the results look like using off-policy algorithms instead? - What models and communication format do you use between agents? Is it all just done in natural language? - Does this require the robots to be in constant communication so they know the tasks each other are doing? - Could you please explain a bit more about the ad-wizard design? How are you going to design it? - How do you optimize the model like which objective function do you use? Is this model similar like an RL model? - Do the agents in your simulation share all information about their observations, or are they only sharing some information? - You project is really interesting. For new tasks you mentioned in the future section, can you give a little bit more insights about one or two new tasks? What difficulties are you expecting to see? Thank you - why we need communication instead of just having a big brother that knows everything? - What RL models are you using for agents? How is the environment setup? - May I ask where is the NLP part in your presentation because there seems to be more robotic part in your pre rather than NLP --- Team 35 - How do you evaluate your results? Is repetitiveness harmful for the model performance based on the BLUE score? - Would using decision trees for feature engineering be more robust in this case? - How does deBerta compare to T5? - The improvements are not salient. How do you think about these results - What are some potential reasons that different models are good at predicting different aspects of an essay score? - What is phraseology in terms of text evaluation, and how is that evaluated for this task? - I am curious about the state-of-the-art model for AES. could you please add some more related previous work? - What is your motivation to add a handcrafted model to an existing model? - What datstets did you guys use or plan to use to pretrain or fintune the models? Is the dataset large enough to ensure no overfitting takes palce unintentionally. - What makes scoring performce of english learner more challenging than native enlish speakers? --- Team 36 - How are you planning to solve the problem related to using spurious intermediate evidence? - Is there a specific reason for choosing AlBERT and XLNet among so many language models? - What do you expect will happen after you train on the entire dataset, do you think you will perform drastically better than 53%? - Is running the LLMs on small batches intentional to give an idea of the accuracy trend? - How does XLNET help improve coherence? - Are you going to design any extra module? It seems that cimply apllying XLNet or ALBERT does not necessarily improve the performance. - Do you think the two models you are experimenting can avoid having the issue of relying on spurious intermediate evidence? - why only Albert and X-Net? why not other models? - What were the main difficulties in reproducing existing results, and main challenges in modifying existing code to use your two models? --- Team 37 - what's your definition of semantic plausibility? - Are planning of using a bigger dataset for experiment 1, since the one you mentioned had only 32 sets/data points)? - How are you testing individual attention heads? - How do you decide whether an attention head is processing plausibility? --- Team 38 - Is the concept of knowledgeable teacher similar to imitation learning in the sub-problem? - how does this work different from recent work on "emergent communication"? - Is there bias existing in the system if so what are they? - Have you collected any data so far? - What are the timeline, do you plan to have result in final paper - To what degree do you predict that your architecture might exhibit transfer learning between fake languages constructed from real-world languages other than English, such as Mandarin? Will your model closely fit to the underlying structural idiosyncrasies of the English language as they contribute to the construction of your fake language, or are the rules learned about this fake language by the model more general? - The project itself is super interesting. However I don't know the exact meaning of learning artificial langugae. Do you have initial language in the start or you have nothing? Do you learning "English" as the artificial language or AI-generated (maybe noisy) language? - Your work is quite impressive! I wonder how you generate these fake words? Is it some substitutions with simple rules, or the grammar also changes? And i wonder why you would like to focus on "cold start". Is it because now you are using fake english, then you do not have such a language model for it. Then the next question will be how many human data assumed to be required for this work? - If the artificial language that the agent learns is not human-understandable, how would we be able to evaluate it? --- Team 39 - What's the advantage of your model in comparison with previous models of visual question answering? - How do you select which patches of the image to use? - No zero-shot for VQA in the literature so far? What baseline do you plan to compare? --- Team 4 - When looking at the 59% statistic, is the number of views/interactions a tweet has taken into account? In other words, if a URL is not opened because nobody saw the tweet, is that still counted as part of the 59%? - How do you measure virality? - Why do you think headlines will improve the accuracy of the model? - How the virality score of the tweet sentences are annotated? Does tweet view counts is a matter for virality? What are the objective factors for a higher virality score? - How would you tune the parameters? - Do you have any baseline or previous work to compare your results 76% with? - Why are two titles being used for the input? Is the model supposed to predict which article between the 2 inputs will go viral, or is it predicting whether or not a given title will be viral? - What is your prediction on what will happen when you do append the headline onto the title when inputting into your model? - How did you arrive at the linear layer dimensions? Experimentation or is there theoretical reasons? - Why did you choose to only use one dropout layer and one linear layer? Did you experiment with different architectures? - What are your classification labels? Is it binary classification or are there different thresholds of virality? - For the prior work - do you think viralBERT would be much less accurate if it only had the tweets text? - How did you label the dataset? Any cross annotator agreement since the task appears subjective? - How is virality defined (i.e. how you do determine which articles "went viral" vs ones that didn't)? - As you have mentioned "sentiment" being a crucial component in predicting how viral a tweet or piece of media will become viral, have you noticed any differences in the type of sentiment being expressed? Does one have a strong inclination for going viral? - In the future plan, when you mentioned including the headline, could it be possible that just the headline is enough to find out if it goes viral or not? - How do you tune the parameters? - In your proposed method, what do the example title and second title represent for input? - Can you explain in more detail where you have improved your work compared to the previous work? --- Team 40 - How exactly do your methods improve verifiability specifically? - Wouldn't training end to end result in better performance? Is there any work suggesting that training on a component level would result in better performance? - When you mentioned your method is training separately, is the implementation going to freeze other modules while training one or completely training separately? --- Team 41 - How is the input like when using transformer - What kind of prior distribution can be used for distribution of candidate actions? - why do you think is it that verb-noun pairs work better than action effect description? - What are trying to predict here. Are we predicting image based of action description? Or predicting action based on image. - Why huggingface image search is less "noisy"? How does it improve the quality of your dataset? - Is the performance gain brought by transformer only? are there any other moving parts compared to the original work? --- Team 42 - Are both of the datasets balanced? What are the distribution of the labels? - Would it be possible to introduce a scale measure (0-10) to measure the amount of PCL instead of binary classification for more comprehensive evaluation? - How do you differentiate condescending/patronizing language between different cultures? A sentence said in one culture may be fine, but in the other, the same sentence may be interpreted as condescending/patronizing. - Why is the result of RoBERTa much worse than BERT, any insight? - In the plan section, you mentioned that TalkDown dataset has a balanced version and an imbalanced version. Why do you want to join those two version to have 1:3 (postive : negative) version? - What are some of the benefits and usages of this patronizing language model? How do you know what is considered patronizing or not? - What are the differences between the TalkDown! and Don't Patronize Me! datasets. How do you expect the performance to differ in both these datasets? - What is the main contribuion of your paper? (other than just applying bert family model) - How do you join the two versions of TalkDown dataset to make it 1:3 split and why should it be 1:3? - Can RNN's and LSTM's acheive the same performance compared to the huge models like RobertA and Bert. Is there any startergy to check if the model is overfitting on the dataset that is relatively small. - Do you think random guessing is a meaningful baseline? It will obviously have very bad performance compared to the learned models, so I'm not sure if it provides much context regarding how effective your learned models are --- Team 43 - What pre-train model you find performed better in your project? - Does using loss functions other than Huber loss function in the task make any positive/negative impacts on the results? - Could you tell more about the CoppeliaSim software that is used to render the environment, does it use text input or coordinates to work? - How significant is initialization for better convergence speed in this case? - I did not see many details for behavior cloning part. Could you elaborate more on this? - How much do you think including color variation will affect the accuracy? How generalizable is your model in terms of improving performances on different kinds of robotic tasks? - How the actions and the commands are related --- Team 44 - What would be candidates for "additional features?" Any thoughts/ideas at this point? - What is the reasoning for using Cnn's instead of transformers? - Would it be possible to check for internal consistency within the news article? As in, could it detect hypocrisy? - Fake news detection: this project explores the credibility and representativeness of headlines and body text. Is this a legitimate perspective of viewing whether a news is fake or not? Cause this seems more like a clickbait detection. Also, how would this perspective take a different approach from the ones classifying the truthfulness of within a sole article? --- Team 45 - After classifying the sentences, is rhe eventual goal to provide a summary of the policy of all relevant terms? - Did you label all the dataset by yourself? - Could be sparse data problem for 48 classes. how do you address this problem? - I find this project very interesting and unique. My question is: Does using other BERT based models (Other than DistillBERT) give good results? Thanks. - Why do you think Bert can handle this problem?(Are there any previous work using bert?) - What model are being used by state-of-art papers in this field, is DistilBERT the most commonly used model ? And why did you finalize DistilBERT? - Have you considered testing different models to see if they have a better accuracy? - It seems that the size of data is insufficient? I wonder if the datasize will be some obstacle to your project. (If yes, how would you handle the obstacle? If no, why it wouldn't be an issue?) - Did you create the dataset yourself? What kind of cookie information are you focusing on and are they all of the same kind i.e. those cookies which are stored even when declined. - Is there a reason why Distilbert was used, is it because it is a light transformer model? --- Team 46 - How do you plan to do the data augmentation? - What kind of data will you augment with? - How are you able to detect the errors in the summaries? --- Team 47 - Are the current evaluation metrics sufficient? If not, what should be improved? - How did you create the web interface? - What is the prediction probability for each word that refers to? (the prediction for what) - How much difference is there when using random portions for ablation vs selective portions? --- Team 48 - I am curious how does Second level of Ensemble Classifier work? Thanks! - What preprocessing method is used to reduce noise in the dataset? - Is there a way to estimate spam emoji's from bot accounts in the Twitter dataset, as it wouldn't be representative of actual human interaction? - how do you ensure the quality of your dataset, both for training and testing? - Are there any emojis that are used in completely different contexts between cultures (and thus between languages)? To what degree might your model account for this? - Any reason why you chose the ensemble model? --- Team 49 - It seems like the reddit joke dataset is mostly in the format of setup + punchline -- do you think you could extend your model to generate jokes with more diverse formats? - Is funny/unfunny annotation overly subjective? Different people may give different responses. How do you quantify the humor-level? - I'm curious -- how was the dataset you're using constructed? What strategy was used to label jokes as either funny or unfunny? Did it use the number of upvotes on the Reddit post, or perhaps some majority vote method of human annotators? I wonder to what degree the latent subjectivity of humor may influence the construction of a dataset on this topic and thus bias your model. - How do you interpret the BLEU score? - What is the method to classify humour into 0 or 1(discrete)? - How large is your dataset? The prompt seems very interesting. I am wondering how human annotators were able to assign a continuous value for humour values. --- Team 5 - Does your model consider which source the article came from when generating headlines? If not, do you notice different results between different sources? - Would training on general publications and evaluating on specialized ones inevitably lead to a stylistic clash in generated headlines? - Why did you sue T5, Pegasus and Pegasus X other than any other transformers? - Did you think take 2016 articles as train set and finetune the model on 2016 data set resonable? - how to capture temporal drift and evaluate whether it's drifted or not? - Do you plan to implement any extractive summarization approaches for comparison? - I assume that the drift in time introduces new terminology between training and testing (such as "COVID-19"). Has this produced issues for your model? - How the human evaluations are preformed? - Did you see bias in your generation based on the training data? (i.e. articles have implicit "leans" or biases that could carry over to downstream tasks) - How do you plan on efficiently performing human evaluation? Will it just be you two or will you somehow be crowdsourcing evaluations? - How are you evaluating whether a headline is valid or not in human evaluation, isn't there a limitation on how many headlines you can evaluate and subjectivity issues? - On the previous work slide, you mentioned vanilla RNN and LSTM. What does it mean by using both of them? Aren't the methodology of these almost same? Have they done great performance in that research? - For the temporal drift, is it mostly the shift in topics between years that causes the problem, or can the way language changes over time affect the model as well? Maybe not enough time passes for this to occur. - How will human evaluation be executed effectively? - How exactly did you finetune the large pretrained transformer models? - Why do evaluations of news titles degrade year by year and how can you tell if this is caused by temporal drift or not? - Any reason why you do not use BLEU? It can measure precision and it is also a very common metrics for text summarization. - Why do ROUGE metrics not properly reflect the performance of the model? Are there alternative quantitative methods for model performance? - That’s a really interesting idea of iteratively training on new years. We’re there any particular area where you noticed it performing worse as time went on? Do you think you would notice bigger differences if you jumped stairs from the oldest year to the ,ost recent year? - How you plan to perform human evaluation of model? - Can you say more about the human evaluation? Also, how do you see this product being used? - I notice that you said the titles are a balancing act between information and interest. How do you plan to specify what information is necessary in the news, and how interesting the headlines should be? - Are there any visible topic shifts in the news in your training set? - How do you choose the most appropriate pre-trained model? - Have you tried some pipelines other than Rouge, because of the flaws of that? - Given that different new outlet have different bias, for example, CNN is usually leaning left, so it might be the case that the model you trained on the CNN news dataset has some bias. Thus, when you test on the other new outlet, which are leaning right, you will less likely generate an appropriate news title. Therefore, I wonder how do you intend to handle such bias? - Since ROUGE is known to be flawed, which alternative metrics could you use for your task instead of ROUGE? - How do you think your work differs from other similar ones? - How would you use GPT3? - What Human Evaluation Models do you guys plan to test your methedology? --- Team 50 - What hyperparameters did you tried? And Do you think you can improve result just tuning hyper parameters? Why? - Why do you pick PIQA as one downstream task? I wonder whether it is the type difference between two tasks that causes the poor performance on PIQA dataset. --- Team 51 - Have you tried this State tracking problem with Chat-GPT that was recently launched by OpenAI? It would be interesting to see if it runs into similar issues. - How do recent advances in dialogue generation, such as those seen in the recent release of ChatGPT, handle the problem of dialogue state tracking? These models, in my experience, seem to handle the problem effectively -- I'd be curious to see if you can compare your results to such bleeding-edge advancements as well. - What is the major difference between GPT2 and GPTNeoX? - do you think ChatGPT will solve the dialogue generation problem? - What's the reason for using GPTNeo-X instead of GPT2 and using a new dataset? - Why would your group feel like GPTNeoX will perform better under your task? - Is it possible to train GPT-NEO X on GreatLakes given the computational constraints of the class account? --- Team 52 - Interesting topic, but do you have some thoughts on how this would be practical? What comes to mind personally is that improving adversarial robustness may lead to more consistent models with a wider range of vocabulary. - What language models are you planning to use? - What type of adversarial attack has been used? - what is the larger application of adversarial attacks and defense? - What are some methods you have considered to improve model performance against adversarial attacks? - what would you like to find after applying the adversarial attack? not just for the purpose of attack. there must be some reason to do it. what is that? - Why is swapping synonyms effective for adversarial attacks? It seems like it shouldn't have much effect - I think the data augmentation may have caused overfitting to your model. Have you checked if the distributions of train and test datasets are the same? - Will you try more datasets to see if your findings are consistent across different datasets? - Why applying attack constraints would lower the test accuray? - You mentioned "Combinations of multiple existing attacks", could you give some examples that you want to try? --- Team 53 - What are the evaluation metrics you plan to use? In your opinion, are these metrics sufficient in other words can adequate reflect machine's ability in this task? - What kind of hyper parameters are you planning on testing? - Is there any reason not using Bert model? --- Team 54 - How do you choose these models? - what's the motivation of adding back translation (what is it anyway)? - You mention "Further explore the effects that data augmentation has in addition to its impact on accuracy." Could you give some possible effects? --- Team 55 - How do your results compared to previous results? - I am a movie geek myself. My question is: Does using graph methods for your task provide significant result rather than using generic Transformers based language models?, like GPT-3 etc. and if yes, how? I am curious :P Thanks! --- Team 56 - can we use transfer learning to speed up the training process in this case? --- Team 57 - Is there a metric other than BLEU which takes into account important aspects of music lyrics? e.g. rhyming, meter, etc. - Do you have any idea of improvement? - evaluation can be quite challenging. Is BLEU sufficient? - How does your song lyric generation help prevent plagiarism in music? - What dataset did you use/how was it generated, and how well would such a model to other artists/genres? - Will there be issues to generate song lyrics for indie labels/less known singers due to lack of data? And if so, is there a way to take this imbalance in the dataset into account? - What kind of input does the GPT2 model take, is it old lyrics? Is there a chance of the GPT2 overfitting due to lack of data? - How can you determine whether the output text sounds like lyric? is there any method? --- Team 58 - I see that Validation set had very high accuracy than the test test(on the baseline afair). My question is: Why would that be and does it impact evaluations? Like Presicion, Recall, etc. and and model inference performance. Thanks! --- Team 59 - What is task switching? - how much training dataset is needed? - Do you plan on testing the newest chat bot? Do you think it'll yield better results? - Are there arguments against the inflexibility of lifelong learning models, and is it possible to create a lifelong learning model that is just as flexible as a new model? - How does the switching period affect the perplexity? - What is the possible reason why the perplexity of culture_and_art(first legend) is so high in the graph? - When you switch to a new topic, how's the performance on the previous tasks? - How does your chosen model allow for better task switching / multi-modal learning? - Can you provide a technical analysis of the method used? - Could you explain more on why the model can do task switching? - What is your approach and how is it connected to the previous work? - Why was perplexity chosen as the metric? - There are set of standard metrics for continual learning/catastrophic forgetting (should be a survey paper on that). What metrics are relevant here and what are not? --- Team 6 - How to evaluate Rap generation? Are the current metrics sufficient? Why or Why not? - What do you mean by song segmentation? - Do you think the newest released ChatGPT can improve your model? It seems very powerful. - What are the core differences between GPT 3 and GPT J? - What are the benefits/drawbacks of GPT-J over GPT-3? - what exactly was the issue with GPT3? lack of documentation or not being able to get a token from OpenAI? - Which dataset is being used? Could you provide more information about the dataset in terms of its distribution and statistics? - Are there any constraints on the output of the model to ensure that the flow/lyrics make sense? - What are the implications of AI generated lyrics, do you think the industry will pivot to those, or use a mixture of real artists and AI generated lyrics. - What do you think your performance compare to the most recent ChatGPT? - What makes GPT-3 hard for finetune with the dataset? Is the form of dataset (songs) matter or just the structure of the GPT-3 too complicated? - Can you provide some quantitative results? Or how would you quantify your results? - How could you enforce more rhyming in the generation? - It's a very cool project! I just wonder how you get the training data? Crawled from the Internet? - What kind of methods of evaluation you think can best predict your model's performance? - are there specific features of rap (eg rhyme, thematic coherence, etc) that you are using for evaluation? - Are you going to use any metrics, other than the human evaluation, to measure the performance of your model? - How is the human evaluation for your project carried out? - What basis (features) should human evaluators look for when evaluating raps? - How will you finetune your model - Can you give several intuitive examples of AI generated rap? - Were your results surprising to you when it comes to generating rap? - Did having a corpus spanning so many different artists slightly blend lyrical styles in the final output? Especially with some tokens being so specific to certain artists - Could you provide some possible explanations on why the results of single artist and multiple artist differ? - What was the reason for finetuning GPT-2/GPT-J versus something like BERT. Is there a transformer that is better suited for lyric generation specifically? - I'm curious how you are going to get human evaluations, I guess the evaluation of lyrics is very subjective, do you have some criteria set? - How do you do human evaluation? Recruiting other people or do it yourself? What are the metrics - A lot of times rappers part of a rappers' sound is pronouncing words in certain ways that can't be reflected just by looking at the lyrics. Do you think that would effect how the AI-generated lyrics sound and if it'll still sound like the artist? - Can you explain how do you avoid overfitting - How long did the fine-tuning take in total, have you considered also evaluating the model on ROUGE? - The sentences of the lyric seem to have too much similarities, have you checked that? - The results are kind of interesting to me given that the sentence pattern and the structure of the song of particular singer like Drake is similar, so I doubt that whether it could generate actually new "sounds". - Where do you think the strength of your work lies in? - What dataset/datsets are being used to train/ finetune this model? Will the same methodology of training can scale up to the multilingual rap generation too. --- Team 60 - how does method 1 address the big problem mentioned in slide 6 - What Q-A demonstrations used the best results? Did changing the Q-A prompting change the results a lot? - Why do you suspect the accuracy went down while the consistency went up? How are the model performances on verifiability? - Are you also trying to improve the verifiability or is the goal just to improve the plausibility? - Can you explain again what the chain like symbolic operations look like and how they are used to enforce consistency? - What loss functions do you plan to use after preliminary one Thanks! --- Team 61 - Will there be baseline for your experiment? - what training dataset will you chose? - What makes convBERT have a lower memory footprint than BERT? - How the reduction in computation cost increase prediction accuracy? - What is the coherence metric? - What will you try to conclude with the coherence for each model - Are there other labels besides entailment = T/F? And how does performance compare across these classes if there are more than one? - What‘s biggest difference between your work and the previous work？ - Do you simply use a new pre-trained model to do predictions? Do you plan to implement other methods? - What is the intuition or possible reason why adding convolution layer outperform the linear layer in the transformer base model? - How do the convolutions in ConvBERT change the way that the model learns entailment? - for the result slide, why there was the better performance for last model? Is there any specific reason? - What was the reason for choosing ConvBERT? Did you consider using any other bert variations? - How much would you expect accuracy to increase after finetuning? So far results seem to be around 50%. - Are you considering improving coherence? - Do you have an idea as to why ConvBert performed better than the others? Is there a particular advantage to this method? - Could you tell me the reason why convBert make better result even if cost efficient? - Why do convolutions seem to perform better than matrices (ConvBERT vs BERT) on the entailment problem? - Why is the accuracy low? Did you fintune the model? - How much memory and computation does ConvBERT cost compared to the base models? - Have you checked the balance of the dataset? Or tried other binary metric like MCC? - Was the model trained or loaded from the pre-trained model pipeline from huggingface? If it was trained, what were the hyperparameters that were used? --- Team 7 - What are the datasets you used for pretraining and downstream tasks? - What do you mean by fine-tuning the dataset? - It seems the given dataset is small(140 pairs of verb-noun), will you decide to deal with larger dataset? - Is it expected that the top 5 and top 10 accuracy is much smaller than the difference between top 1 and top 5? - What do you think about your results compare to word vector arithmetic (i.e. king - man + woman = queen) then do image generation? - Why there is a huge gap between top1 acc and top5 acc? In the mean time, why there is a small gap between top5 acc and top10 acc - How does Clip-Gen work to visualize your implementation? - How does the model differentiate synonyms like "cut apple" vs "slice apple"? - Why do you prefer a zero-shot model? Could you explain more on why zero-shot is better for this task? - How did you pick your pre-trained models? - What's the main takeaways of figures in Slides 8 and 10? What is the optimal pattern? - What do Top {1, 5, 10} mean in your results? - What is a baseline accuracy for this task? ie human performance? - I am just curious how do you combine the image and text when prompting? Is there any mask strategy or special encoding strategy to specify it is an image encoding or a token encoding? - In the effect prediction section, for the task "Given an action in the form of a verb-noun pair, identify images that best matches the effect", what is the number of the candidate images? - Have you checked the balance of the dataset? Or tried other binary metric like MCC? - How would you finetune your model using different prompts? Would you be using the entire dataset to finetune the model? - When using CLIP, have you noticed differences in alignment between the various datapoints? - Did you manually create the dataset? - What's your innovation? - Have you considered the effect of the singular and plural of the noun on the result - Didn't really understand this in the slide: The distance of (<dog as text>, <house as text>) is not guaranteed to make sense even the distance of (<dog as text>, <dog as image>) is minimized. For the paper, maybe try to explain this more? --- Team 8 - Does the prediction only work for the explicit expression of the story text? or Is it also work for the implicit expression or metaphor? - For TRIP - does conflict detection mean that they understand the task overall or can just detect words that have conflicting states? - Why were previous methods have a low verifiability? - What is the difference between "verifiability" and interpretability? - How was the authors able to evaluate the verifiability? - I guess use a linear layer for ensemble or just simply use moving average of a single model can help. - How is TRIP different from Natural Language Inference tasks? - Why do you want to train different models instead of sampling from the same datasets and use one model to train? - What kinds of ensemble methods are you planning to use other than majority vote? Is there any reasoning behind your particular choices? - Since the overall performance is pretty good, how do you (or authors) know that model has very low verifiability? - What are the core differences between these 3 BERT models and how could they help differently to 3 tasks in TRIP? - Why do you think the average output of three models will be be better than fine-tune the model for the specific task using training data? Or What is the difference between three models you are using? Will three attention head sets in one model achieve the same result? - Would your model work better if you tried another BERT such as RoBERTa? - Is there added any benefit in verifiability or explainability with the ensemble method? If not, are there any good ways to address this? --- Team 9 - How did you manually qualitatively code so many pages? It sounds like it would take a really really long time. - Which binary classifier are you using (pre made or certain architecture) - Could you expand a little more on what it means to identify value from your dataset? How did you go about doing that and determining what parts are considered value as opposed to others? - Did you divided each page as an each tuples for datasets? Was it divided properly? I think it might have some issues with discontinued sentences or so. - Which is more important for your specific task, precision or recall? (or equal), and why? - Did you consider using a more domain specific language model in addition to BERT, was there a specific reason for BERT other than it being a popular choice? - What led you to pick the BERT model as opposed to some of the other ones that are available? - Will accuracy, precision, or recall be the most important measure for considering a success for your project? - Which words do you suspect will be driving attention labels? - What application area the method can be eactly the classification used for? - what's the most challenging part of using custom dataset? - how many instances in your dataset? Only "value" "no value" classification? The classification seems subjective. Do you plan to have multiple people annotation a samll dataset and check their agreement? - How do you know whether your results are good? Do you have any baseline to compare with? - How long did it take you to make your custom dataset? - Do you think certain words would have much more weight in driving the result of the classifier, and why do you think so? Is there any way for you to figure out which words might have much more impact in the classification result? - Would your model work better had you used another BERT such as RoBERTa? - Why do you use BERT? What separates it from other pre-trained language models in your project? - Do you consider getting k-gram involved in your model? - How do you calculate your accuracy? - Could I get your ideas on what values do you plan to examine and use in the future? ---